ClubReady Digital Insights Workshop

if(!require(easypackages)){install.packages("easypackages")}
library(easypackages)
packages("plyr", "dplyr", "data.table", "xda","ggplot2", "forcats", "readr", "gridExtra", "knitr", "onehot", 
        "tidyr", "purrr", "survival",  prompt = FALSE)

Abstract

ClubReady asked Valorem to assist in the development on machine learning algorithms to predict user churn and store churn. After an thorough exploratory data analysis, the data provided sufficient information to develop an algorithm that appears to predict store churn with relatively high accuracy. However, the data does not provide sufficient information to predict user churn.

Introduction

This document details the processes and outcomes from a Valorem Digital Insight Workshop to predict churn models.

Client Description

ClubReady LLC operates a Web-based club management platform for fitness facilities, fitness individuals and large corporate chains. It specializes in member management, billing, EFT and POS and sales process/CRM. The company was incorporated in 2014 and is based in St. Louis, Missouri.

Client Business Problem

ClubReady is the 3rd largest company in their industry. The company recognizes the need to leverage their data assets to gain deeper knowledge and understanding of customer behaviors. ClubReady has taken steps in this area to begin analyzing the significant data it captures through its membership software applications, and seeks support to further explore its data assets. ClubReady reached out to Valorem as a preferred Microsoft partner with expertise in Advanced Analytics, including Azure Machine Learning and the Cortana Intelligence Suite of Azure services.

ClubReady strives to become a data-first company focused on membership and member retention insights gained through an exploration of ClubReady data.

Client Engagement

On October 3, 2017, ClubReady entered into an agreement with Valorem statement for a Digital Insight Workshop. The workshop provided the following deliverables

  • Data Assessment and Quality Report
  • Digital Analytics Vision & Roadmap of Actionable Insights
  • Documented Key Priorities
  • Potential Digital Insights Program ROI
  • Recommended and Prioritized Analytics Actions
  • Analytics Roadmap and Estimates
  • Presentation of Next Steps

As documented later in the report, these deliverables were collapsed into fewer deliverables as agreed during the on-site workshop meeting.

Digital Insights Process

Per the SOW, the Digital Insights Workshop Project flow includes:

  1. Pre-Work and Agenda
  2. On-Site Workshop
  3. Analytics Run
  4. Present Findings

Analytics Run is renamed in this document to Exploratory Data Analysis

Pre-Work and Agenda

On November 20, 2017, Valorem’s Project Manager led an introductory pre-workshop Skype Meeting. Andy Sweet, the ClubReady CTO and Project Sponsor participated with Valorem Data Scientists. The meeting consisted of:

  • Team Introductions
  • Project Overview
  • Communications Plan
  • Q & A

The core project Team Members were identified:

Team Member Company & Title
Andy Sweet ClubReady CTO
Justin Trusty ClubReady Data Architect
Lauren Crosby ClubReady Director of Product Management
Matt Mercurio ClubReady Director of Engineering
Brad Llewellyn Valorem Data Scientist
Cliff Weaver Valorem Data Scientist
Brian Roselli Valorem Project Management

On-Site Workshop

The on-site workshop was held at ClubReady on November 28 - 29. The agenda presented at the workshop included:

  • Day 1
    • ClubReady vision, goals, priorities
    • What can machine learning do for you?
    • Problem statements
    • Redefine deliverables
    • Churn Example
    • Data Sources & Definitions
  • Day 2
    • Review previous day successes and misses
    • Data Exploration
    • Q&A
    • Time Permitting (Clients Choice)
    • Introduction to R

During the two-day workshop, the agenda was roughly followed. Regardless of the path, all the objectives to the workshop were accomplished:

  • Understanding ClubReady goals and priorities
  • Development of well-formed questions
  • Review and Modification to Deliverables and Format
  • Access to data
  • Selection of data for analysis

Understanding ClubReady Goals & Priorities

At the start of the on-site workshop, Andy Sweet provided an overview of ClubReady, its FY18 Goals and the mission to become a data-first company believing this will provide a strategic advantage over its competitors.

Well-Formed Question Development

The success of any data science project starts with a well-formed question. A well-formed question is a prerequisite to a project because without it, the project is likely to fail. At a minimum, a well-formed question provides:

  • A statement of the problem or issue that needs to be solved.
  • Detailed description of the data available for analysis. This includes data that is not not available.
  • A description of what ClubReady would like to be able to predict or categorize.
  • How ClubReady will consume the output from this project (assuming the data supports an algorithmic solution).

Two questions were identified:

  1. Identify what clubs (full service and DIY) are likely to fail based on percentage drop in revenue. It was determined an 80% drop in revenue compared to the previous month(s) is the comparison metric. Note: Clubs were analysed and not separated into full service and DIY populations.
  2. Identify the individual customer propensity at the club level (lowest organization in the hierarchy) to terminate their club agreement. (This excludes one-time users - i.e., not club members.) Member Churn is defined by an individual customer agreement termination with no renewal within 30 days.

Review and Modification to Deliverables and Format

The Statement of Work provided the following outputs:

  • Data Assessment and Quality Report
  • Digital Analytics Vision & Roadmap of Actionable Insights
  • Documented Key Priorities
  • Potential Digital Insights Program ROI
  • Recommended and Prioritized Analytics Actions
  • Analytics Roadmap and Estimates
  • Presentation of Next Steps

During the workshop, example documentation built using RMarkdown language saved in HTML format was presented. The ClubReady Team agreed this format was acceptable for all project reporting. (This document is the result of that agreement.)

Access to ClubReady Data

On the 2nd day of the workshop, Valorem was granted and verified access to ClubReady data.

Data Selection

ClubReady reviewed the tables and variables available in the ClubReady database (over 600 tables and 3,000+ fields). It was quickly realized a majority of the variables were sparsely populated. The Team implemented custom SQL code to identify the sparsity of each variable. The Team then identified the initial set of candidate data ClubReady believed to be important to answer the well-formed questions. Recognize the data available for analysis was a small subset of the all the database data.

Data Caveats & Assumptions

  • There was no method to identify when data was populated. If a column was 60% dense, the field would not have been selected even if that variable is 100% dense in the last year or two. There may be useful data that was not used in this analysis.
  • There is no guarantee all impactful variables were identified. Reviewing a large dataset with a small Team does not ensure all important variables were included in the data used for modeling.
  • Team decided the last 2 years of data would be used for the project. It was in this period ClubReady experienced significant growth.

Exploratory Data Analysis

This section of the document presents the results of the Exploratory Data Analysis process. stores.csv data file is explored first in detail followed by ClassesService.csv and users.csv with less commentary content.

This document section is organized as follows:

  • One of three CSV data files, stores.csv, is explored with detail in the first section
  • In the following sections, the other two data files (User and Class & Services) are explored albeit with less explanatory content
  • All of these sections roughly follow the same thought process:
    • Get Data
    • Remove/transform NAs
    • Character to Factors
    • Explore factors with tables and plots
    • Explore numerical data
    • Explore variances
    • Manage duplicate records
    • Data Glimpse

Stores

Get Data

Working with ClubReady, Valorem developed SQL scripts capturing raw data ClubReady data in csv format. The SQL code will be provided upon request.

The stores data returns 42322, 1 records with 42322, 1 variables. Of the 135 variables, 42322 is a character type and 42322 are numeric.

Categorical Variables

Examine the categorical variables first:

charSummary(stores)
##                                               n miss miss% unique
## Status                                    42322    0     0      3
## AverageMinsPerClass_Service               42322    0     0     83
## AveragePricePerClass_Service              42322    0     0    772
## AverageCancellationHrsPerClass_Service    42322    0     0     36
## AverageRescheduleDeadlinePerClass_Service 42322    0     0     40
## AverageMinsPerClass                       42322    0     0     76
## AveragePricePerClass                      42322    0     0    495
## AverageCancellationHrsPerClass            42322    0     0     33
## AverageRescheduleDeadlinePerClass         42322    0     0     35
## AverageMinsPerService                     42322    0     0     57
## AveragePricePerService                    42322    0     0    471
## AverageCancellationHrsPerService          42322    0     0     35
## AverageRescheduleDeadlinePerService       42322    0     0     42
## UserEmployeeRatio                         42322    0     0   1446
## UserEmployeeRatio_Last3Months             42322    0     0   1326
##                                                                                                  top5levels:count
## Status                                                                    Active:37262, Inactive:4397, Cancel:663
## AverageMinsPerClass_Service                                        NULL:17693, 45:3120, 60:2182, 50:1775, 30:1597
## AveragePricePerClass_Service              NULL:17693, 0.000000:3183, 45.000000:1087, 20.000000:531, 57.500000:293
## AverageCancellationHrsPerClass_Service                               NULL:19039, 24:9395, 0:2152, 1:1592, 12:1549
## AverageRescheduleDeadlinePerClass_Service                            NULL:19039, 24:7588, 1:2221, 3:1673, 12:1500
## AverageMinsPerClass                                                 NULL:22835, 60:5776, 45:1656, 30:1263, 52:972
## AveragePricePerClass                       NULL:22835, 0.000000:3212, 20.000000:811, 10.000000:684, 25.000000:522
## AverageCancellationHrsPerClass                                        NULL:25858, 24:7273, 0:2713, 1:1893, 12:863
## AverageRescheduleDeadlinePerClass                                     NULL:25858, 24:5249, 1:2587, 3:1741, 2:1401
## AverageMinsPerService                                              NULL:20946, 45:4126, 40:1968, 60:1851, 50:1823
## AveragePricePerService                    NULL:20922, 0.000000:2916, 45.000000:1674, 30.000000:457, 60.000000:445
## AverageCancellationHrsPerService                                    NULL:21008, 24:9905, 0:2002, 16:1703, 12:1655
## AverageRescheduleDeadlinePerService                                 NULL:21008, 24:8306, 12:1636, 16:1589, 1:1484
## UserEmployeeRatio                                                       NULL:3274, 18:382, 21:346, 28:342, 23:338
## UserEmployeeRatio_Last3Months                                            NULL:2923, 7:412, 21:393, 25:393, 16:389

We learn the only categorical variable is Status. Status is a factor with several levels.

stores <- stores %>% mutate_if(is.character, as.factor)
#charSummary(stores)
table(stores$Status)
## 
##   Active   Cancel Inactive 
##    37262      663     4397
ggplot(stores, aes(fct_infreq(Status))) + geom_bar() + xlab("ClubReady Status Code") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

ClubReady provided direction on how to manage Active, Inactive, and Cancel:

Inactive is a store that is no longer with ClubReady. Most likely left to go to competitor. Cancel is a store that was in the process of being setup and for some reason stopped. Ignore cancel.

Remove records where Status is not either Active or Inactive.

stores <-  filter(stores, Status == 'Active' | Status == 'Inactive')
table(stores$Status)
## 
##   Active   Cancel Inactive 
##    37262        0     4397
stores$Status <- factor(stores$Status)
table(stores$Status)
## 
##   Active Inactive 
##    37262     4397

After removing the records associated with Status fields, 41659 records remain.

Numerical Data

Review the numerical data. Pay attention to the missing data percentage in shown in the right-most column below. (Note, only showing the first 20 of the 41659 numerical variables.

myNumSum <- numSummary(stores)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missCNT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(n))
head(myNumSum, 20)
##                       Variable_Name     n nunique nzeros missCNT miss%
## 1                           StoreId 41659    2616      0       0     0
## 2                           IsChurn 41659       2  41078       0     0
## 3                   IsChurn_2Months 41659       2  40026       0     0
## 4                   IsChurn_3Months 41659       2  39602       0     0
## 5                   IsChurn_6Months 41659       2  38505       0     0
## 6                        SMSEnabled 41659       2      4       0     0
## 7                    StoreAgeMonths 41659     119    290       0     0
## 8                         Amenities 41659      28  30988       0     0
## 9            Integration_ABCCheckIn 41659       2  41224       0     0
## 10          Integration_ABCDataTrak 41659       2  41574       0     0
## 11               Integration_dotFit 41659       2  41259       0     0
## 12               Integration_FitBPO 41659       1  41659       0     0
## 13           Integration_fitRewards 41659       2  41252       0     0
## 14            Integration_Listen360 41659       2  30365       0     0
## 15           Integration_MotionSoft 41659       1  41659       0     0
## 16            Integration_Paramount 41659       1  41659       0     0
## 17            Integration_Perkville 41659       2  37633       0     0
## 18              Integration_PumpOne 41659       2  41634       0     0
## 19         Integration_ShareYourFit 41659       1  41659       0     0
## 20 Integration_VisualFitnessPlanner 41659       1  41659       0     0

Good, no missing data!

Examine the numerical data visually to illustrate interesting distributions.

cntNumNames <- length(select(select_if(stores,is.numeric), -StoreId))
#Make plots max 6 at a time - change if needed
maxPlot = 6
loopCnt <- cntNumNames %/% maxPlot
remainder <- cntNumNames %% maxPlot

myLoop_DF <- data.frame(x = seq(1, cntNumNames-remainder, by = maxPlot), y = seq(6, cntNumNames, by = maxPlot))
myLoopMax <- max(myLoop_DF)

for(i in 1:nrow(myLoop_DF)){
  myplot <- select(select_if(stores,is.numeric), -StoreId)[myLoop_DF[i,1]:myLoop_DF[i,2]]%>% gather() %>% ggplot(aes(value)) +
      facet_wrap(~ key, scales = "free") + geom_histogram() #+  geom_density()
  print(myplot)
}

Most of the plots above are not very interesting but there are a few that we might want to revisit individually including Amenities, Forms, and NonRequiredForms.

p1 <- ggplot(filter(stores, Amenities > 0), aes(Amenities)) + geom_bar() + ggtitle("Amenties > 0")
p2 <- ggplot(filter(stores, Forms > 0), aes(Amenities)) + geom_bar() + ggtitle("Forms > 0")
p3 <- ggplot(filter(stores, NonRequiredForms > 0), aes(Amenities)) + geom_bar() + ggtitle("NonRequiredForms > 0")
grid.arrange(p1, p2, p3, ncol=3)

Variability

Evaluate how much variability there is in each numerical variable.

#Col 1 -s StoreId, 20 is Status
myVariance <- as.data.frame(apply(stores[,-c(1,20)], 2, var))
myVariance <- tibble::rownames_to_column(myVariance)
names(myVariance)[2] <- "Variance"
myVariance <-  myVariance %>% mutate(Variance2 = ifelse(Variance == 0, "No", "Yes"))
table(myVariance$Variance2)
## 
##  No Yes 
##   7 110

Because 7 variables have no variance - all the values are the same, they can be removed from the working dataset. If there are no differences in a column, it is of no use in the development of an algorithm. The variables removed because there is no variance are:

VarNames <- myVariance %>% filter(Variance > 0) %>% select(rowname)
zeroVarNames <- myVariance %>% filter(Variance == 0) %>% select(rowname)
stores <- stores %>% select(StoreId, Status, unlist(VarNames))
zeroVarNames
##                                  rowname
## 1                     Integration_FitBPO
## 2                 Integration_MotionSoft
## 3                  Integration_Paramount
## 4               Integration_ShareYourFit
## 5       Integration_VisualFitnessPlanner
## 6 IntegrationType_LandingPages_Marketing
## 7            IntegrationType_SocialMedia

Outlier Detection

In the working dataset, there is one variable, TotalRevenue to be evaluated for potential outliers. There are many way to visualize outliers. Boxplots are the most commonly used visualization.

out2 <- ggplot(stores, aes(x = "", y = TotalRevenue)) + geom_boxplot(outlier.color="red", outlier.shape=8, outlier.size=4) + 
  scale_y_continuous(labels = scales::dollar)

There are potential outliers. ClubReady may want to review the legitmacy of these values.

Here is a list of the highest 25 TotalRevenue records:

tmpRev <- arrange(stores, desc(TotalRevenue)) %>% select(TotalRevenue)
tmpRev <- as.data.frame(head(scales::dollar(tmpRev$TotalRevenue), 25))
names(tmpRev) <- "Total_Revenue"
tmpRev
##    Total_Revenue
## 1     $9,155,556
## 2     $9,107,405
## 3     $8,044,618
## 4     $7,917,987
## 5     $7,695,146
## 6     $7,595,443
## 7     $7,553,951
## 8     $7,528,441
## 9     $7,150,346
## 10    $7,049,678
## 11    $6,973,129
## 12    $6,755,486
## 13    $6,742,283
## 14    $6,614,312
## 15    $6,604,380
## 16    $6,453,092
## 17    $6,298,918
## 18    $6,294,541
## 19    $6,193,235
## 20    $6,076,802
## 21    $5,976,849
## 22    $5,541,476
## 23    $5,444,348
## 24    $4,954,580
## 25    $4,687,897

Duplicates

Lastly, check for duplicate records. In this case, none are found because each record has a unique StoreId value. Without this variable, you would find 11 duplicates. (Interesting in its own right.)

# Duplicate Records
cat("The number of duplicated rows is", nrow(stores) - nrow(unique(stores)))
## The number of duplicated rows is 0
# if(nrow(stores) - nrow(unique(stores)) > 0){
#   head(stores[duplicated(stores),])
#   stores <- stores[!duplicated(stores),]
# }

Data Overview

The working dataset has been initially scrubbed and evaluated. Take time to review and learn about the data.

The plots below illustrate how often many of the ClubReady configuration options are turned on.

First examine the variable names that start with Integration.

myColTotal <- as.data.frame(colSums(Filter(is.numeric, stores)))
myColTotal <- tibble::rownames_to_column(myColTotal)
names(myColTotal)[1] <- "Variable_Name"
names(myColTotal)[2] <- "Sum_of_Variable"
myColTotal <- filter(myColTotal, !Variable_Name %in% c("StoreId", "TotalRevenue", "Amenities"))
myColTotal$Variable_Name <- as.factor(myColTotal$Variable_Name)
myColTotal <- myColTotal %>% arrange(desc(Sum_of_Variable))

myColTotal_Int <- filter(myColTotal, Variable_Name %like% "Integration")
ggplot(myColTotal_Int, aes(x=Variable_Name, y=Sum_of_Variable)) + 
  geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Variable Name") + ylab("Number of Times Option Selected") + scale_x_discrete(limits= myColTotal_Int$Variable_Name)

Six options appear to be much more popular than the others:

  1. Listen360
  2. Surveys
  3. Club Management System
  4. Rewards Program
  5. Perkville
  6. Data Trak

Below we find that most checkins occur between 8-11AM and 4-7PM local time. (Need to confirm)

myColTotal_checkins <- filter(myColTotal, Variable_Name %like% "Checkins_H" )
myColTotal_checkins <- myColTotal_checkins[!grepl("Months", myColTotal_checkins$Variable_Name),]
myColTotal_checkins$Variable_Name <- plyr::revalue(myColTotal_checkins$Variable_Name, c("Checkins_Hour0to1" = "0100", 
            "Checkins_Hour1to2" = "0200", 
            "Checkins_Hour2to3" = "0300", "Checkins_Hour3to4" = "0400", "Checkins_Hour4to5" = "0500",
            "Checkins_Hour5to6" = "0600", "Checkins_Hour6to7" = "0700", "Checkins_Hour7to8" = "0800",
            "Checkins_Hour8to9" = "0900", "Checkins_Hour9to10" = "1000", "Checkins_Hour10to11" = "1100",
            "Checkins_Hour11to12" = "1200", "Checkins_Hour12to13" = "1300", "Checkins_Hour13to14" = "1400",
            "Checkins_Hour14to15" = "1500", "Checkins_Hour15to16" = "1600", "Checkins_Hour16to17" = "1700", 
            "Checkins_Hour17to18" = "1800", "Checkins_Hour18to19" = "1900", "Checkins_Hour19to20" = "2000",
            "Checkins_Hour20to21" = "2100", "Checkins_Hour21to22" = "2200", "Checkins_Hour22to23" = "2300",
            "Checkins_Hour23to0" = "2400"))

myColTotal_checkins$Variable_Name <- factor(myColTotal_checkins$Variable_Name, 
      levels = c("0100", "0200", "0300", "0400", "0500", "0600", "0700", "0800", "0900", "1000", "1100", "1200",
                "1300", "1400", "1500", "1600", "1700", "1800", "1900", "2000", "2100", "2200", "2300", "2400"))

ggplot(myColTotal_checkins, aes(x=Variable_Name, y=Sum_of_Variable)) + 
  geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Variable Name") + ylab("Number of Times Option Selected")

Lastly, Monday, Tuesday and Wednesday have the most checkins during the week.

myColTotal_dow <- myColTotal %>% filter(!Variable_Name %like% "Checkins_H", Variable_Name %like% "Checkins_")
myColTotal_dow$Variable_Name <- substring(myColTotal_dow$Variable_Name, 10, 10000)
myColTotal_dow$Variable_Name <- as.factor(myColTotal_dow$Variable_Name)

myColTotal_dow2 <- myColTotal_dow[!grepl("_", myColTotal_dow$Variable_Name),]
myColTotal_dow2_1 <- myColTotal_dow2[grepl("day", myColTotal_dow2$Variable_Name),]
myColTotal_dow2_1 <- myColTotal_dow2_1[!grepl("Weekday", myColTotal_dow2_1$Variable_Name),]
myColTotal_dow2_1$Variable_Name <- factor(myColTotal_dow2_1$Variable_Name, levels = c("Monday", "Tuesday", "Wednesday",
                                  "Thursday", "Friday", "Saturday", "Sunday"))

ggplot(myColTotal_dow2_1, aes(x=Variable_Name, y=Sum_of_Variable)) + 
  geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Variable Name") + ylab("Number of Times Option Selected")

Data Glimpse

Here is what the resulting working dataset looks like. We are left with 41659 records and 112 (we began the journey with 5321 records and 88 variables).

glimpse(stores)
## Observations: 41,659
## Variables: 112
## $ StoreId                              <int> 60, 105, 419, 206, 206, 3...
## $ Status                               <fctr> Active, Active, Inactive...
## $ IsChurn                              <int> 0, 1, 0, 0, 1, 0, 0, 0, 0...
## $ IsChurn_2Months                      <int> 0, 1, 0, 1, 1, 0, 0, 0, 0...
## $ IsChurn_3Months                      <int> 0, 1, 0, 1, 1, 0, 0, 0, 0...
## $ IsChurn_6Months                      <int> 0, 1, 1, 1, 1, 1, 0, 0, 1...
## $ SMSEnabled                           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ StoreAgeMonths                       <int> 92, 95, 76, 75, 77, 75, 8...
## $ Amenities                            <int> 0, 2, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_ABCCheckIn               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_ABCDataTrak              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_dotFit                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_fitRewards               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_Listen360                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Integration_Perkville                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_ClubManagementSystem <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_RewardsProgram       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_Supplements_Wellness <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ IntegrationType_Surveys              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Forms                                <int> 2, 0, 0, 0, 0, 0, 1, 2, 0...
## $ RequiredForms                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NonRequiredForms                     <int> 2, 0, 0, 0, 0, 0, 1, 2, 0...
## $ NonRequiredWaiver                    <int> 1, 0, 0, 0, 0, 0, 1, 1, 0...
## $ NonRequiredHealthHistory             <int> 1, 0, 0, 0, 0, 0, 0, 1, 0...
## $ RequiredWaiver                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins                             <int> 1029, 0, 0, 0, 0, 0, 614,...
## $ Checkins_Sunday                      <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Monday                      <int> 225, 0, 0, 0, 0, 0, 156, ...
## $ Checkins_Tuesday                     <int> 232, 0, 0, 0, 0, 0, 125, ...
## $ Checkins_Wednesday                   <int> 156, 0, 0, 0, 0, 0, 119, ...
## $ Checkins_Thursday                    <int> 216, 0, 0, 0, 0, 0, 115, ...
## $ Checkins_Friday                      <int> 160, 0, 0, 0, 0, 0, 81, 7...
## $ Checkins_Saturday                    <int> 39, 0, 0, 0, 0, 0, 18, 6,...
## $ Checkins_Weekend                     <int> 40, 0, 0, 0, 0, 0, 18, 6,...
## $ Checkins_Weekday                     <int> 989, 0, 0, 0, 0, 0, 596, ...
## $ Checkins_Hour0to1                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour1to2                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour2to3                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour3to4                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour4to5                    <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour5to6                    <int> 56, 0, 0, 0, 0, 0, 20, 45...
## $ Checkins_Hour6to7                    <int> 25, 0, 0, 0, 0, 0, 43, 14...
## $ Checkins_Hour7to8                    <int> 82, 0, 0, 0, 0, 0, 52, 24...
## $ Checkins_Hour8to9                    <int> 74, 0, 0, 0, 0, 0, 38, 69...
## $ Checkins_Hour9to10                   <int> 237, 0, 0, 0, 0, 0, 29, 9...
## $ Checkins_Hour10to11                  <int> 115, 0, 0, 0, 0, 0, 35, 6...
## $ Checkins_Hour11to12                  <int> 69, 0, 0, 0, 0, 0, 41, 32...
## $ Checkins_Hour12to13                  <int> 32, 0, 0, 0, 0, 0, 42, 12...
## $ Checkins_Hour13to14                  <int> 26, 0, 0, 0, 0, 0, 40, 9,...
## $ Checkins_Hour14to15                  <int> 19, 0, 0, 0, 0, 0, 29, 17...
## $ Checkins_Hour15to16                  <int> 14, 0, 0, 0, 0, 0, 40, 31...
## $ Checkins_Hour16to17                  <int> 40, 0, 0, 0, 0, 0, 42, 41...
## $ Checkins_Hour17to18                  <int> 110, 0, 0, 0, 0, 0, 50, 3...
## $ Checkins_Hour18to19                  <int> 46, 0, 0, 0, 0, 0, 70, 38...
## $ Checkins_Hour19to20                  <int> 50, 0, 0, 0, 0, 0, 36, 22...
## $ Checkins_Hour20to21                  <int> 30, 0, 0, 0, 0, 0, 6, 0, ...
## $ Checkins_Hour21to22                  <int> 3, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour22to23                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour23to0                   <int> 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Morning                     <int> 658, 0, 0, 0, 0, 0, 258, ...
## $ Checkins_Afternoon                   <int> 241, 0, 0, 0, 0, 0, 243, ...
## $ Checkins_Evening                     <int> 129, 0, 0, 0, 0, 0, 112, ...
## $ Checkins_Night                       <int> 1, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Past3Months                 <int> 3160, 3543, 0, 0, 0, 0, 1...
## $ Checkins_Sunday_Past3Months          <int> 11, 163, 0, 0, 0, 0, 0, 0...
## $ Checkins_Monday_Past3Months          <int> 574, 684, 0, 0, 0, 0, 376...
## $ Checkins_Tuesday_Past3Months         <int> 672, 643, 0, 0, 0, 0, 390...
## $ Checkins_Wednesday_Past3Months       <int> 621, 758, 0, 0, 0, 0, 423...
## $ Checkins_Thursday_Past3Months        <int> 665, 548, 0, 0, 0, 0, 354...
## $ Checkins_Friday_Past3Months          <int> 484, 500, 0, 0, 0, 0, 228...
## $ Checkins_Saturday_Past3Months        <int> 133, 247, 0, 0, 0, 0, 40,...
## $ Checkins_Weekend_Past3Months         <int> 144, 410, 0, 0, 0, 0, 40,...
## $ Checkins_Weekday_Past3Months         <int> 3016, 3133, 0, 0, 0, 0, 1...
## $ Checkins_Hour0to1_Past3Months        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour1to2_Past3Months        <int> 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Hour2to3_Past3Months        <int> 0, 0, 0, 0, 0, 0, 4, 0, 0...
## $ Checkins_Hour3to4_Past3Months        <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour4to5_Past3Months        <int> 9, 12, 0, 0, 0, 0, 0, 0, ...
## $ Checkins_Hour5to6_Past3Months        <int> 127, 85, 0, 0, 0, 0, 94, ...
## $ Checkins_Hour6to7_Past3Months        <int> 78, 162, 0, 0, 0, 0, 182,...
## $ Checkins_Hour7to8_Past3Months        <int> 164, 198, 0, 0, 0, 0, 158...
## $ Checkins_Hour8to9_Past3Months        <int> 194, 275, 0, 0, 0, 0, 146...
## $ Checkins_Hour9to10_Past3Months       <int> 657, 248, 0, 0, 0, 0, 110...
## $ Checkins_Hour10to11_Past3Months      <int> 395, 248, 0, 0, 0, 0, 98,...
## $ Checkins_Hour11to12_Past3Months      <int> 236, 408, 0, 0, 0, 0, 112...
## $ Checkins_Hour12to13_Past3Months      <int> 177, 307, 0, 0, 0, 0, 87,...
## $ Checkins_Hour13to14_Past3Months      <int> 107, 202, 0, 0, 0, 0, 93,...
## $ Checkins_Hour14to15_Past3Months      <int> 83, 134, 0, 0, 0, 0, 80, ...
## $ Checkins_Hour15to16_Past3Months      <int> 84, 149, 0, 0, 0, 0, 106,...
## $ Checkins_Hour16to17_Past3Months      <int> 142, 257, 0, 0, 0, 0, 132...
## $ Checkins_Hour17to18_Past3Months      <int> 334, 237, 0, 0, 0, 0, 135...
## $ Checkins_Hour18to19_Past3Months      <int> 161, 238, 0, 0, 0, 0, 189...
## $ Checkins_Hour19to20_Past3Months      <int> 132, 160, 0, 0, 0, 0, 76,...
## $ Checkins_Hour20to21_Past3Months      <int> 73, 116, 0, 0, 0, 0, 7, 0...
## $ Checkins_Hour21to22_Past3Months      <int> 6, 102, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour22to23_Past3Months      <int> 0, 4, 0, 0, 0, 0, 0, 0, 0...
## $ Checkins_Hour23to0_Past3Months       <int> 0, 1, 0, 0, 0, 0, 1, 0, 0...
## $ Checkins_Morning_Past3Months         <int> 1851, 1624, 0, 0, 0, 0, 9...
## $ Checkins_Afternoon_Past3Months       <int> 927, 1286, 0, 0, 0, 0, 63...
## $ Checkins_Evening_Past3Months         <int> 372, 620, 0, 0, 0, 0, 272...
## $ Checkins_Night_Past3Months           <int> 10, 13, 0, 0, 0, 0, 6, 0,...
## $ PurchasedSessions                    <int> 3414, 8, 484, 24, 0, 471,...
## $ PurchasedSessions_Past3Months        <int> 14802, 432, 1731, 24, 46,...
## $ TotalRevenue                         <dbl> 42708, 240, 19992, 2700, ...
## $ TotalRevenue_Past3Months             <dbl> 120245, 36470, 57193, 270...
## $ ActiveUsers                          <int> 2489, 4396, 0, 0, 0, 0, 7...
## $ ActiveUsers_Last3Months              <int> 2519, 4413, 0, 0, 0, 0, 7...
## $ ActiveEmployees                      <int> 13, 4, 0, 0, 0, 0, 5, 13,...
## $ ActiveEmployees_Last3Months          <int> 17, 16, 0, 0, 0, 0, 6, 17...
## $ Classes_Services                     <int> 12, 109, 0, 9, 9, 0, 22, ...
## $ Classes                              <int> 5, 66, 0, 1, 1, 0, 13, 5,...
## $ Services                             <int> 7, 43, 0, 8, 8, 0, 9, 7, ...

Classes & Services

Get Data

#Get Data
Classes_Services <- read_csv("C:/Users/cweaver/Downloads/Classes_Services.csv", 
                   col_types = cols(AvailablePIF = col_integer(), 
                      ByPerson = col_integer(), CanSeeInstructor = col_integer(), 
                      CancellationHrs = col_integer(), 
                      ClassMins = col_integer(), CustSelfBook = col_integer(), 
                      Disabled = col_integer(), EmailReminders = col_integer(), 
                      HalfHour = col_integer(), MultipleInstructors = col_integer(), 
                      MustHaveCredit = col_integer(), MustPreBook = col_integer(), 
                      NumPerClass = col_integer(), OnTheHour = col_integer(), 
                      QuarterPast = col_integer(), QuarterTill = col_integer(), 
                      RescheduleDeadline = col_integer(), 
                      RuleCustomers = col_integer(), RuleFrontDesk = col_integer(), 
                      SMSReminders = col_integer(), ServicesId = col_integer(), 
                      ShowPublic = col_integer(), StandardPrice = col_integer()), 
                   na = "NA")

myData <- Classes_Services
rm(Classes_Services)

ClassesServices.csv returned 34055 records with 28 variables. Of the 28 variables, 1 is a character type and 27 are numeric.

glimpse(myData)
## Observations: 34,055
## Variables: 28
## $ Type                  <chr> "Class", "Class", "Class", "Class", "Cla...
## $ ClassId               <int> 2, 3, 33, 36, 37, 39, 41, 43, 44, 48, 51...
## $ ServicesId            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ StoreId               <int> 1, 1, 11, 17, 20, 20, 20, 20, 20, 24, 20...
## $ Disabled              <int> NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, NA,...
## $ MustPreBook           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, ...
## $ CustSelfBook          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, ...
## $ MustHaveCredit        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ClassMins             <int> 45, 60, 45, 120, 60, 60, 60, 30, 30, 60,...
## $ NumPerClass           <int> 6, 10, 9, 200, 10, 12, 40, 2, 4, 24, NA,...
## $ StandardPrice         <int> 0, 15, NA, NA, NA, NA, NA, NA, 25, 10, 1...
## $ MultipleInstructors   <int> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ AvailablePIF          <int> NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ EmailReminders        <int> 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ SMSReminders          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CanSeeInstructor      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ShowPublic            <int> NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ CanDirectlyBookPublic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AllowWaitList         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ ByPerson              <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ OnTheHour             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ QuarterPast           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HalfHour              <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ QuarterTill           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ RuleFrontDesk         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ RuleCustomers         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ CancellationHrs       <int> 12, 12, 24, 48, 24, 24, 24, 24, 24, 24, ...
## $ RescheduleDeadline    <int> 12, 12, 24, 48, 12, 12, 12, 12, 12, 24, ...

Remove NA

Just as before, remove the NAs with 0s:

myData <- myData %>%  mutate_if(is.integer, funs(replace(., is.na(.), 0)))
myData <- myData %>%  mutate_if(is.double, as.integer)
glimpse(myData)
## Observations: 34,055
## Variables: 28
## $ Type                  <chr> "Class", "Class", "Class", "Class", "Cla...
## $ ClassId               <int> 2, 3, 33, 36, 37, 39, 41, 43, 44, 48, 51...
## $ ServicesId            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ StoreId               <int> 1, 1, 11, 17, 20, 20, 20, 20, 20, 24, 20...
## $ Disabled              <int> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0...
## $ MustPreBook           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ CustSelfBook          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ MustHaveCredit        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ClassMins             <int> 45, 60, 45, 120, 60, 60, 60, 30, 30, 60,...
## $ NumPerClass           <int> 6, 10, 9, 200, 10, 12, 40, 2, 4, 24, 0, ...
## $ StandardPrice         <int> 0, 15, 0, 0, 0, 0, 0, 0, 25, 10, 150, 0,...
## $ MultipleInstructors   <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ AvailablePIF          <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmailReminders        <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SMSReminders          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CanSeeInstructor      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ShowPublic            <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ CanDirectlyBookPublic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AllowWaitList         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ ByPerson              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ OnTheHour             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterPast           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ HalfHour              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterTill           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleFrontDesk         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleCustomers         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CancellationHrs       <int> 12, 12, 24, 48, 24, 24, 24, 24, 24, 24, ...
## $ RescheduleDeadline    <int> 12, 12, 24, 48, 12, 12, 12, 12, 12, 24, ...

That looks much better!

Character to Factors

Change the variable Type from a character to a factor.

myData <- myData %>% mutate_if(is.character, as.factor)
class(myData$Type)
## [1] "factor"

Explore Factors

charSummary(myData)
##          n miss miss% unique           top5levels:count
## Type 34055    0     0      2 Class:17465, Service:16590
myData_factor <- myData %>% select_if(is.factor)

for(i in 1:length(myData_factor)){
  print(names(myData_factor[i]))
  print(table(myData_factor[i]))
}
## [1] "Type"
## 
##   Class Service 
##   17465   16590
for(i in 1:length(myData_factor)){
  print(ggplot(myData_factor, aes_string(names(myData_factor[i]))) + geom_bar())
}

Numerical Data

myNumSum <- numSummary(myData)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missPCT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(missPCT))
head(myNumSum, 20)
##            Variable_Name     n nunique nzeros missPCT miss%
## 1                ClassId 34055   17466  16590       0     0
## 2             ServicesId 34055   11109  17465       0     0
## 3                StoreId 34055    3142      0       0     0
## 4               Disabled 34055       2  27661       0     0
## 5            MustPreBook 34055       2  22355       0     0
## 6           CustSelfBook 34055       2  17846       0     0
## 7         MustHaveCredit 34055       2  25805       0     0
## 8              ClassMins 34055      35   1385       0     0
## 9            NumPerClass 34055      75  22266       0     0
## 10         StandardPrice 34055     193  19681       0     0
## 11   MultipleInstructors 34055       2  26495       0     0
## 12          AvailablePIF 34055       2  24489       0     0
## 13        EmailReminders 34055       2  24428       0     0
## 14          SMSReminders 34055       2  32101       0     0
## 15      CanSeeInstructor 34055       2  23596       0     0
## 16            ShowPublic 34055       2  18127       0     0
## 17 CanDirectlyBookPublic 34055       2  33340       0     0
## 18         AllowWaitList 34055       2  31613       0     0
## 19              ByPerson 34055       2  17466       0     0
## 20             OnTheHour 34055       2  17561       0     0

Good, no missing data!

Variances

myVariance <- as.data.frame(apply(myData[,-c(1)], 2, var))
myVariance <- tibble::rownames_to_column(myVariance)
names(myVariance)[2] <- "Variance"
myVariance <-  myVariance %>% mutate(Variance2 = ifelse(Variance == 0, "No", "Yes"))
table(myVariance$Variance2)
## 
## Yes 
##  27

All the variables have a variance > 0.

if(table(myVariance$Variance2)[1] > 0){
  filter(myVariance, Variance2 == "No")
  VarNames <- myVariance %>% filter(Variance > 0) %>% select(rowname)
  myData <- myData %>% select(StoreId, unlist(VarNames))
}

Duplicate Records

If duplicates are found, they will be removed.

cat("The number of duplicated rows is", nrow(myData) - nrow(unique(myData)))
## The number of duplicated rows is 12
myData[duplicated(myData),]
## # A tibble: 12 x 27
##    StoreId ClassId ServicesId Disabled MustPreBook CustSelfBook
##      <int>   <int>      <int>    <int>       <int>        <int>
##  1    1507       0       6563        0           0            0
##  2    2785       0      18588        0           0            0
##  3    3167       0      18710        0           0            1
##  4    3167       0      18710        0           0            1
##  5    3950       0      20519        0           0            0
##  6    4096       0      20804        0           0            0
##  7    4097       0      20808        0           0            0
##  8    4098       0      20812        0           0            0
##  9    4099       0      20816        0           0            0
## 10    4010       0      23251        0           0            0
## 11    4127       0      24098        0           0            0
## 12    5897       0      24235        0           0            0
## # ... with 21 more variables: MustHaveCredit <int>, ClassMins <int>,
## #   NumPerClass <int>, StandardPrice <int>, MultipleInstructors <int>,
## #   AvailablePIF <int>, EmailReminders <int>, SMSReminders <int>,
## #   CanSeeInstructor <int>, ShowPublic <int>, CanDirectlyBookPublic <int>,
## #   AllowWaitList <int>, ByPerson <int>, OnTheHour <int>,
## #   QuarterPast <int>, HalfHour <int>, QuarterTill <int>,
## #   RuleFrontDesk <int>, RuleCustomers <int>, CancellationHrs <int>,
## #   RescheduleDeadline <int>
if(nrow(myData) - nrow(unique(myData)) > 0){
  head(myData[duplicated(myData),])
  myData <- myData[!duplicated(myData),]
}

Data Glimpse

glimpse(myData)
## Observations: 34,055
## Variables: 27
## $ StoreId               <int> 1, 1, 11, 17, 20, 20, 20, 20, 20, 24, 20...
## $ ClassId               <int> 2, 3, 33, 36, 37, 39, 41, 43, 44, 48, 51...
## $ ServicesId            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Disabled              <int> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0...
## $ MustPreBook           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ CustSelfBook          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1...
## $ MustHaveCredit        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ClassMins             <int> 45, 60, 45, 120, 60, 60, 60, 30, 30, 60,...
## $ NumPerClass           <int> 6, 10, 9, 200, 10, 12, 40, 2, 4, 24, 0, ...
## $ StandardPrice         <int> 0, 15, 0, 0, 0, 0, 0, 0, 25, 10, 150, 0,...
## $ MultipleInstructors   <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ AvailablePIF          <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmailReminders        <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SMSReminders          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CanSeeInstructor      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ShowPublic            <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ CanDirectlyBookPublic <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ AllowWaitList         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ ByPerson              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ OnTheHour             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterPast           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ HalfHour              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ QuarterTill           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleFrontDesk         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RuleCustomers         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CancellationHrs       <int> 12, 12, 24, 48, 24, 24, 24, 24, 24, 24, ...
## $ RescheduleDeadline    <int> 12, 12, 24, 48, 12, 12, 12, 12, 12, 24, ...

Users

Get Data

Users <- read_csv("C:/Users/cweaver/Downloads/Users.csv", col_types = cols(StoreId = col_integer()), progress = FALSE)
myData <- Users
rm(Users)

Users.csv returned 17516305 records with 48 variables. Of the 48 variables, 2 is a character type and 46 are numeric.

glimpse(myData)
## Observations: 17,516,305
## Variables: 48
## $ UserId                     <int> 1, 2, 3, 7, 8, 1059, 1169, 1172, 11...
## $ StoreId                    <int> NA, NA, 0, 0, NA, 1, 1, 1, 1, 1, 1,...
## $ CheckInCredentialsModified <int> 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Gender                     <chr> "NULL", "NULL", "M", "M", "NULL", "...
## $ UserType                   <chr> "CrStaff", "CrStaff", "CrStaff", "D...
## $ Amenities                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Tags                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins                   <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 9, 3, 1,...
## $ Checkins_Sunday            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Checkins_Monday            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1,...
## $ Checkins_Tuesday           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ Checkins_Wednesday         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ Checkins_Thursday          <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 5, 0, 0,...
## $ Checkins_Friday            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ Checkins_Saturday          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour0to1          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour1to2          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour2to3          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour3to4          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour4to5          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour5to6          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour6to7          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour7to8          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour8to9          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Checkins_Hour9to10         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour10to11        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,...
## $ Checkins_Hour11to12        <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 5, 0, 0,...
## $ Checkins_Hour12to13        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Checkins_Hour13to14        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ Checkins_Hour14to15        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour15to16        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour16to17        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour17to18        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour18to19        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ Checkins_Hour19to20        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ Checkins_Hour20to21        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour21to22        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour22to23        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour23to0         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PurchasedSessions          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TotalSpent                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Forms                      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredForms              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredForms           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredWaiver          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredHealthHistory   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredWaiver             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredHealthHistory      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Remove NA

There do not appear to be as many NAs as we have seen before. This time they appear prevalent in the StoreId variable. Also note NULL in the Gender field.

myData <- myData %>%  mutate_if(is.integer, funs(replace(., is.na(.), 0)))#changes int to dbl
myData[,-41] <- myData %>%  mutate_if(is.double, as.integer)#return to int
glimpse(myData$StoreId)
##  int [1:17516305] 0 0 0 0 0 1 1 1 1 1 ...

StoreId looks better now.

Character to Factors

Gender and UserType are character variables. Change them to factors.

myData <- myData %>% mutate_if(is.character, as.factor)
charSummary(myData)
##                 n miss    miss% unique
## Gender   17516304    1 5.71e-06     20
## UserType 17516305    0 0.00e+00     13
##                                                                                                            top5levels:count
## Gender                                                     F:6181600, M:5638707, NULL:2834500, Female:1527379, Male:1321975
## UserType ClubClient:15676776, DeletedClubClient:1645260, ClubClientTemporary:63819, DeletedClubAdmin:62840, ClubAdmin:55389
myData_factor <- myData %>% select_if(is.factor)

Note Gender has many missing values. These will be managed later.

Tables for the Factors

Examine the factors in the working dataset.

for(i in 1:length(myData_factor)){
  print(names(myData_factor[i]))
  print(table(myData_factor[i]))
}
## [1] "Gender"
## 
##         "         1     Erkek         f         F    female    Female 
##         1         6         2      2326   6181600      3700   1527379 
##         g     Kadin         m         M      male      Male Masculino 
##         1         1      2349   5638707      2734   1321975         1 
##         n      NULL         p         u         U 
##       133   2834500         1        39       849 
## [1] "UserType"
## 
##                 CAS           ClubAdmin          ClubClient 
##                   5               55389            15676776 
## ClubClientTemporary         ClubTrainer           CorpAdmin 
##               63819                 823                3256 
##             CrStaff    DeletedClubAdmin   DeletedClubClient 
##                 165               62840             1645260 
##  DeletedClubTrainer    DeletedCorpAdmin      DeletedCrStaff 
##                2776                2263                 164 
##      Duplicate User 
##                2769
rm(myData_factor)

There is much work to do on the Gender variable. Will also choose the appropriate UserType values for modeling.

User Factor - Gender

The values in Gender are varied. These will need to be collapsed into a couple of factor levels.

myData %>% group_by(Gender) %>% summarize(Unique_Values = n()) %>% arrange(desc(Unique_Values))
## # A tibble: 20 x 2
##       Gender Unique_Values
##       <fctr>         <int>
##  1         F       6181600
##  2         M       5638707
##  3      NULL       2834500
##  4    Female       1527379
##  5      Male       1321975
##  6    female          3700
##  7      male          2734
##  8         m          2349
##  9         f          2326
## 10         U           849
## 11         n           133
## 12         u            39
## 13         1             6
## 14     Erkek             2
## 15      "\""             1
## 16         g             1
## 17     Kadin             1
## 18 Masculino             1
## 19         p             1
## 20      <NA>             1
ggplot(myData, aes(fct_infreq(Gender))) + geom_bar() + xlab(paste("ClubReady Subset - ", names(myData)[1])) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

levels(myData$Gender)[levels(myData$Gender) == "F"] <- "Female"
levels(myData$Gender)[levels(myData$Gender) == "female"] <- "Female"
levels(myData$Gender)[levels(myData$Gender) == "f"] <- "Female"

levels(myData$Gender)[levels(myData$Gender) == "M"] <- "Male"
levels(myData$Gender)[levels(myData$Gender) == "male"] <- "Male"
levels(myData$Gender)[levels(myData$Gender) == "m"] <- "Male"

myData %>% group_by(Gender) %>% summarize(Unique_Values = n()) %>% arrange(desc(Unique_Values))
## # A tibble: 14 x 2
##       Gender Unique_Values
##       <fctr>         <int>
##  1    Female       7715005
##  2      Male       6965765
##  3      NULL       2834500
##  4         U           849
##  5         n           133
##  6         u            39
##  7         1             6
##  8     Erkek             2
##  9      "\""             1
## 10         g             1
## 11     Kadin             1
## 12 Masculino             1
## 13         p             1
## 14      <NA>             1

This looks better but there are still suspect values. We will remove:

  • All the Gender value counts that are small will be removed. This affects everything from U and below in the table above.
  • It is reasonable to assume Gender is an informative variable in churn modeling. Valorem will initially remove the NULL values from Gender.
myData <- filter(myData, Gender == "Female" | Gender == "Male")
myData$Gender <- factor(myData$Gender)

ggplot(myData, aes(fct_infreq(Gender))) + geom_bar() + xlab("ClubReady Subset - Gender") +
  scale_y_continuous(labels = scales::comma)

User Factor - UserType

myData %>% group_by(UserType) %>% summarize(Unique_Values = n()) %>% arrange(desc(Unique_Values))
## # A tibble: 13 x 2
##               UserType Unique_Values
##                 <fctr>         <int>
##  1          ClubClient      13111028
##  2   DeletedClubClient       1425029
##  3    DeletedClubAdmin         56073
##  4           ClubAdmin         44420
##  5 ClubClientTemporary         32757
##  6           CorpAdmin          3221
##  7  DeletedClubTrainer          2773
##  8    DeletedCorpAdmin          2250
##  9      Duplicate User          2210
## 10         ClubTrainer           819
## 11             CrStaff           106
## 12      DeletedCrStaff            79
## 13                 CAS             5
ggplot(myData, aes(fct_infreq(UserType))) + geom_bar() + xlab(paste("ClubReady Subset - ", names(myData)[2])) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(labels = scales::comma)

ClubReady confirmed Valorem to use:

  • ClubClient
  • DeletedClubClient
  • ClubClientTemporary
myData <- filter(myData, UserType == 'ClubClient' | UserType == 'DeletedClubClient' | UserType == 'ClubClientTemporary')
myData$UserType <- factor(myData$UserType)

ggplot(myData, aes(fct_infreq(UserType))) + geom_bar() + xlab("ClubReady Subset - UserType") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(labels = scales::comma)

Numerical Data

Because of the large number of User records, a random sample is selected in the code below.

myNumSum <- numSummary(sample_frac(myData, .3))[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missPCT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(missPCT))
head(myNumSum, 20)
##                 Variable_Name       n nunique  nzeros missPCT miss%
## 1                      UserId 4370644 4370644       0       0     0
## 2                     StoreId 4370644    3528    2532       0     0
## 3  CheckInCredentialsModified 4370644       2  480413       0     0
## 4                   Amenities 4370644      41 4221592       0     0
## 5                        Tags 4370644       7 4349532       0     0
## 6                    Checkins 4370644    1963 3218561       0     0
## 7             Checkins_Sunday 4370644     386 3922237       0     0
## 8             Checkins_Monday 4370644     495 3497281       0     0
## 9            Checkins_Tuesday 4370644     495 3499396       0     0
## 10         Checkins_Wednesday 4370644     490 3501510       0     0
## 11          Checkins_Thursday 4370644     463 3531767       0     0
## 12            Checkins_Friday 4370644     449 3614680       0     0
## 13          Checkins_Saturday 4370644     412 3756634       0     0
## 14          Checkins_Hour0to1 4370644     523 4335798       0     0
## 15          Checkins_Hour1to2 4370644     131 4358393       0     0
## 16          Checkins_Hour2to3 4370644     135 4361939       0     0
## 17          Checkins_Hour3to4 4370644     219 4360650       0     0
## 18          Checkins_Hour4to5 4370644     509 4334624       0     0
## 19          Checkins_Hour5to6 4370644     772 4235945       0     0
## 20          Checkins_Hour6to7 4370644     652 4185863       0     0

No missing data is in the working dataset.

Variance

#Do not include UserId, StoreId, Gender, UserType, TotalSpent
myVariance <- as.data.frame(apply(myData[,-c(1,2,4,5,41)], 2, var))
myVariance <- tibble::rownames_to_column(myVariance)
names(myVariance)[2] <- "Variance"
myVariance <-  myVariance %>% mutate(Variance2 = ifelse(Variance == 0, "No", "Yes"))
table(myVariance$Variance2)
## 
##  No Yes 
##   1  42

Because 1 variables have no variance - all the values are the same, they can be removed from the working dataset. If there are no differences in a column, it is of no use in the development of an algorithm. The variables to be removed because there is no variance are:

if(table(myVariance$Variance2)[1] > 0){
  VarNames <- myVariance %>% filter(Variance > 0) %>% select(rowname)
  zeroVarNames <- myVariance %>% filter(Variance == 0) %>% select(rowname)
  myData <- myData %>% select(UserId, StoreId, Gender, TotalSpent,  unlist(VarNames))
  zeroVarNames
}
##          rowname
## 1 RequiredWaiver

Outlier Detection

In the working dataset, there is one variable, TotalSpent that should be evaluated to identify any potential outlier. There are many way to visualize outliers. While boxplots are the most commonly used visualization, because the number of records is large, plotting is not an optimal reporting option - it takes a long time to plot millions of records.

Comparing the opposite ends of TotalSpent produces interesting information:

tmpRevDesc <- arrange(myData, desc(TotalSpent)) %>% select(TotalSpent)
tmpRevDesc <- tmpRevDesc[1:25,]
tmpRevDesc <- as.data.frame(scales::dollar(tmpRevDesc$TotalSpent))
names(tmpRevDesc) <- "Total_Spent"

tmpRevAsc <- arrange(myData, TotalSpent) %>% select(TotalSpent)
tmpRevAsc <- tmpRevAsc[1:25,]
tmpRevAsc <- as.data.frame(scales::dollar(tmpRevAsc$TotalSpent))
names(tmpRevAsc) <- "Total_Spent"

knitr::kable(list(tmpRevDesc, tmpRevAsc))
Total_Spent
$10,004,136
$3,204,780
$2,759,211
$2,680,537
$2,640,971
$2,401,029
$2,382,642
$2,375,716
$2,358,190
$2,260,660
$2,164,586
$2,128,461
$2,092,029
$2,073,144
$2,041,499
$2,019,114
$2,004,718
$1,996,531
$1,991,202
$1,981,019
$1,953,575
$1,929,700
$1,891,658
$1,799,289
$1,787,595
Total_Spent
$-638,972
$-293,628
$-42,475
$-27,678
$-18,986
$-16,860
$-16,254
$-15,842
$-14,992
$-14,991
$-14,542
$-12,195
$-10,793
$-8,495
$-7,579
$-7,496
$-6,207
$-5,797
$-4,245
$-4,098
$-4,098
$-4,098
$-3,995
$-3,798
$-3,398

The highest TotalSPent value is [r tmpRevDesc[1,] and the lowest value is $-638,972.

It appears the large values may be associated with account credits. This may be a data quality issue.

Duplication

cat("The number of duplicated rows is", nrow(myData) - nrow(unique(myData)))
## The number of duplicated rows is 0
if((nrow(myData) - nrow(unique(myData)))>0) myData[duplicated(myData),]
if(nrow(myData) - nrow(unique(myData)) > 0){
  head(myData[duplicated(myData),])
  myData <- myData[!duplicated(myData),]
}

Good news - no duplicate records.

Data Glimpse

Here is what the resulting working dataset looks like. We are left with 14568814 records and 46.

glimpse(myData)
## Observations: 14,568,814
## Variables: 46
## $ UserId                     <int> 1169, 1172, 1178, 1187, 1188, 1191,...
## $ StoreId                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Gender                     <fctr> Male, Male, Female, Male, Male, Ma...
## $ TotalSpent                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CheckInCredentialsModified <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Amenities                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Tags                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins                   <int> 1, 0, 0, 9, 3, 1, 0, 1, 0, 0, 0, 0,...
## $ Checkins_Sunday            <int> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Monday            <int> 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Tuesday           <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Wednesday         <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ Checkins_Thursday          <int> 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Friday            <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Saturday          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour0to1          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour1to2          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour2to3          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour3to4          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour4to5          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour5to6          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour6to7          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour7to8          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour8to9          <int> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour9to10         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour10to11        <int> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour11to12        <int> 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour12to13        <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour13to14        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour14to15        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour15to16        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour16to17        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour17to18        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour18to19        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour19to20        <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ Checkins_Hour20to21        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour21to22        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour22to23        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Checkins_Hour23to0         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PurchasedSessions          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Forms                      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredForms              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredForms           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredWaiver          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NonRequiredHealthHistory   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ RequiredHealthHistory      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Initial Data Modeling

This section walks through the Initial Data Modeling phase to identify candidate algorithms to predict different churn models for ClubReady. The section is organized as follows:

  • One of two CSV data files for modelling, stores.csv, is explored with detail in the first section
  • In the following sections, the other data file (Users) is explored albeit with less explanatory content
  • All of these sections roughly follow the same thought process:
    • Develop modeling data structure
    • Create modeling data file and import to Azure Machine Learning Studio
    • Perform Test/Training Data Split
    • Determine “Optimal” Initial Model

All of the SQL used to develop these datasets can be found in the [ML].[ChurnScripts] stored procedure within the [Reports] database.

Stores

Develop Modeling Data Structure

What is Churn?

During the On-Site Workshop, ClubReady provided a basic definition of Store Churn. If a Store recognizes at least an 80% drop in Revenue over the course of a single month, then the Store is considered to be Churned.

We then need to answer What is Revenue? ClubReady provided the definition that Revenue is the total dollars invoiced to the Store by ClubReady. This information is available in the [PurchaseLog] table within the [ClubReady] database. Within this table, we utilize [StoreId], [PaymentMade] and [PurchaseAmount] columns to determine when the store has churned.

A single store cannot churn more than once within our time frame - a two-year time frame starting in November 2015 and ending in October 2017.

Dataset Granularity

The next important consideration is the granularity of the modelling dataset. Granularity defines what each row in the dataset represents. In general, the granularity of the modelling dataset should match the data as it is intended to be used in production. During the On-Site Workshop, this was discussed. The intention is to receive some type of notification that a store is likely to churn within the next few months. Therefore, a dataset where each row describes a single store for a single month and whether that store churned within that month or within the next few months is required. This allows ClubReady to run a monthly reporting or notification system that alerts them which stores are at risk of churning.

Create Modeling Data File

Using the information from the On-Site Workshop combined with what we learned in the Exploratory Data Analysis phase, the modeling dataset is created. The following subject areas within the ClubReady database were identified:

  • Store Info (Status, Age)
  • SMS
  • Amenities
  • Integrations
  • Forms
  • Checkins
  • Purchased Sessions
  • Revenue
  • Active Users
  • Active Employees
  • Classes
  • Services

The goal is to create a monthly “snapshot” of each store showing information about that store at that point in time. However, not all of these subject areas can be queried historically. This means that is no option but to take the data as it is today, assuming that it has not changed over time.

The following subject areas do not track historic information:

  • Store info (Status)
  • SMS
  • Amenities
  • Integrations
  • Forms
  • Classes
  • Services

This leaves the following subject areas with historic information:

  • Store Info (Age)
  • Checkins
  • Purchased Sessions
  • Revenue
  • Active Users
  • Active Employees

To empower the models using time intelligence, metrics for these subject areas over the past three months were included. For instance, there are variables for [TotalRevenue_Past3Months] and [ActiveMembers_Past3Months].

This was all combined into a single dataset. The queries to create the Stores.csv file can be found in the [ML].[ClubChurn] and [ML].[CompleteStoreData] stored procedures in the [Reports] database.

Perform Test/Training Data Split

Similar to the “Modeling Data Structure” phase, the training and testing data should represent a production use case as closely as possible. ClubReady wants to be able to identify whether a store is likely to churn in the next few months. In this instance, three months was selected as the time interval. Since the last month in the dataset is October 2017, the testing month should be August 2017. The records for August 2017 contain information about the month of August, as well as the previous two months for the Past3Months variables. These records also contain a variable [IsChurn_3Months] describing whether the store will churn in the next three months. This is what the model attempts to predict.

In machine learning, it is important the model be tested using data that was not used to train it. For instance, if a store churns in August 2017, then that will be reflected in the [IsChurn_3Months] variable for the months of June and July as well.

The following training/testing split was developed:

  • Training Set 1
    • Start Month: January 2016
    • End Month: May 2017
  • Testing Set 1
    • Start Month: August 2017
    • End Month: August 2017
  • Variable to Predict: IsChurn_3Months

For additional testing, other splits were created:

  • Training Set 2
    • Start Month: January 2016
    • End Month: April 2017
  • Testing Set 2
    • Start Month: July 2017
    • End Month: July 2017
  • Training Set 3
    • Start Month: January 2016
    • End Month: March 2017
  • Testing Set 3
    • Start Month: June 2017
    • End Month: June 2017

Determine “Optimal” Initial Model

The Initial Modeling phase has two goals.

  1. Determine how much of a “pattern” is easily accessible within the data. For instance, if the best model created in this phase falls substantially below what we would consider “reasonable”, then it may decided more or different data is required.

  2. If patterns exist, then use the information to point towards the optimal way to improve the model using feature engineering or additional model optimization.

Selecting an Evaluation Metric

The selection of an evaluation metric is one of the most crucial aspects of the data modelling process. Recognizing that we are in a Binary Classification situation, we are limited to two main contenders, Accuracy/AUC and Precision/Recall. In general, Accuracy/AUC are acceptable when the two classes, “Churn” and “Not Churn” in this case, are approximately equally likely. However, in cases where one class is substantially more likely than another, Precision/Recall is the standard.

finalstoredata <- read_csv("C:/Users/cweaver/Downloads/Stores_2017_12_11.csv", col_types = cols())

churn <- finalstoredata["IsChurn_3Months"] %>% 
    group_by(IsChurn_3Months) %>%
    summarise(count=n()) %>% 
    mutate(perc=count/sum(count))

ggplot(churn, aes(x = factor(IsChurn_3Months), y = perc*100)) + geom_bar(stat="identity") + ggtitle("IsChurn_3Months") + xlab("IsChurn_3Months") + ylab("Percentage") + geom_text(aes(x = factor(IsChurn_3Months), y = perc*100, label = paste(round(perc*100,0),"%")), nudge_y = 4)

Since [IsChurn_3Months] variable is heavily imbalanced, Precision/Recall will be used as the performance metric. Precision is the percentage of correct Churn predictions out of all Churn predictions. Conversely, Recall is the percentage of correct Churn predictions out of all actual Churn records. Recall tells how many of the actual Churns being predicted while Precision shows how many of the predictions are actually Churns. Both of these metrics tell different stories and are very important. Maximizing both of them is ideal. Therefore, an evaluation metric of Precision * Recall is used.

The “Kitchen Sink” Approach

Azure Machine Learning Studio contains fifteen distinct Binary Classification algorithms. Fourteen of these can utilize a module called “Tune Model Hyper-parameters”. This module will allow us to train and test multiple machine learning algorithms using different sets of hyper-parameters. Therefore, using all fourteen modules, it is possible to get insights quickly and easily. The process looks like this:

The full results of this process can be found in the Store Churn Data Modelling Results.xlsx file. Here are the top three models for predicting Store Churn:

Model Family Hyper-parameters Precision Recall
Boosted Decision Tree Leaves: 54 ~ Minimum Instances: 19 ~ Learning Rate: 0.336396 ~ Trees: 51 89.3% 38.4%
Boosted Decision Tree Leaves: 5 ~ Minimum Instances: 37 ~ Learning Rate: 0.030064 ~ Trees: 362 100.0% 30.8%
Boosted Decision Tree Leaves: 6 ~ Minimum Instances: 15 ~ Learning Rate: 0.356604 ~ Trees: 482 95.2% 30.8%

These results are promising. There is evidence there is a pattern within the data that may be used to predict when a Store is likely to churn.

Users

Develop Modeling Data Structure

What is Churn?

During the On-Site Workshop, ClubReady provided a definition of User Churn. If a User’s contract expires and is not renewed by the end of the next month, then that User has churned. This information is available in the [ContractPurchases] table within the [ClubReady] database. Within this table, [UserId], [ActivationDateUTC], [AgreedDate] and [Cancelled] columns are used.

Similar to Store Churn, a single user to cannot churn more than once within the time frame.

Dataset Granularity

The granularity of the User Churn data is per user, per month, for every month that the user has an active contract, plus an additional record for the month after their contract ends. This is the “Churn” month.

Create Modeling Data File

Using the information from the On-Site Workshop combined with what was learned in the Exploratory Data Analysis phase, a User modeling dataset can be developed. The following subject areas within the ClubReady database were identified:

  • Store Churn Data
  • User Info (Gender, Type, CheckinCredentialsModified)
  • Amenities
  • Tags
  • Checkins
  • Forms
  • Purchased Sessions
  • Spend

The goal is to create a monthly “snapshot” of each user showing information about that user at that point in time. However, not all of these subject areas can be queried historically.

The following subject areas do not track historic information:

  • Store Churn Data (see previous section)
  • User Info (Gender, Type, CheckinCredentialsModified)
  • Amenities
  • Tags
  • Forms

This leaves the following subject areas with historic information:

To empower the models using time intelligence, additional variables for [TotalSpend_Past3Months] and [Checkins_Past3Months] were created.

All of this information was combined into a single dataset. The queries to create the User.csv file can be found in the [ML].[MemberChurn] and [ML].[CompleteUserData] stored procedures in the [Reports] database.

Perform Test/Training Data Split

The logic for the test/training split is identical to that of “Store Churn”.

  • Training Set 1
    • Start Month: January 2016
    • End Month: May 2017
  • Testing Set 1
    • Start Month: August 2017
    • End Month: August 2017
  • Variable to Predict: IsChurn_3Months

For additional testing, we can also move backwards any number of months to create new training/testing splits as follows:

  • Training Set 2
    • Start Month: January 2016
    • End Month: April 2017
  • Testing Set 2
    • Start Month: July 2017
    • End Month: July 2017
  • Training Set 3
    • Start Month: January 2016
    • End Month: March 2017
  • Testing Set 3
    • Start Month: June 2017
    • End Month: June 2017

Determine “Optimal” Initial Model

Selecting an Evaluation Metric

A major difference between the Store and User data is the overall size of the data. The [FinalUserData] table is 24GB as a CSV. Randomly selected sets of 1M, 5M, 10M and 20M rows were created

finaluserdata <- read_csv("C:/Users/cweaver/Downloads/Users_1M_2017_12_12.csv", col_types = cols(
    Store_UserEmployeeRatio = col_double(),
    Store_UserEmployeeRatio_Last3Months = col_double(),
    Store_AverageMinsPerClass_Service = col_double(),
    Store_AveragePricePerClass_Service = col_double(),
    Store_AverageCancellationHrsPerClass_Service = col_double(),
    Store_AverageRescheduleDeadlinePerClass_Service = col_double(),
    Store_AverageMinsPerClass = col_double(),
    Store_AveragePricePerClass = col_double(),
    Store_AverageCancellationHrsPerClass = col_double(),
    Store_AverageRescheduleDeadlinePerClass = col_double(),
    Store_AverageMinsPerService = col_double(),
    Store_AveragePricePerService = col_double(),
    Store_AverageCancellationHrsPerService = col_double(),
    Store_AverageRescheduleDeadlinePerService = col_double())
    ,progress = FALSE
)

churn <- finaluserdata["IsChurn_3Months"] %>% group_by(IsChurn_3Months) %>% summarise(count=n()) %>% mutate(perc=count/sum(count))

ggplot(churn, aes(x = factor(IsChurn_3Months), y = perc*100)) + geom_bar(stat="identity") + ggtitle("IsChurn_3Months") + xlab("IsChurn_3Months") + ylab("Percentage") + geom_text(aes(x = factor(IsChurn_3Months), y = perc*100, label = paste(round(perc*100,0),"%")), nudge_y = 4)

The “Kitchen Sink” Approach

The full results of this process can be found in User Churn Data Modelling Results.xlsx file. Here are the top three models for predicting Store Churn:

Model Family Hyper-parameters Precision Recall
Boosted Decision Tree Leaves: 32 ~ Minimum Instances: 6 ~ Learning Rate: 0.252098531 ~ Trees: 270 57.1% 16.5%
Boosted Decision Tree Leaves: 54 ~ Minimum Instances: 19 ~ Learning Rate: 0.336396247 ~ Trees: 51 55.0% 15.9%
Boosted Decision Tree Leaves: 17 ~ Minimum Instances: 13 ~ Learning Rate: 0.06286619 ~ Trees: 50 54.5% 12.6%

The Precision and Recall of these metrics are quite low. Other methods are required to develop a feasible model - if the data allows an algorithmic solution.

Further Investigation

The datasets used to develop the initial data models contained 220 predictors and either 1, 5 or 10 million records. Each dataset provided similar results. Therefore, it may be worth considering ways to reduce the number of predictors in the model. It is possible 220 predictors with a large number of zeros is simply overwhelming the modeling algorithms leading to poor models. Therefore, reducing the number of predictors without losing much information, may improve algorithm performance. A common technique for this is Principal Components Analysis (PCA). Before PCA can be performed, the dataset must be cleaned by removing non-predictive columns.

ignore.set <- c("UserId", "StoreId", "ReferenceMonthStartDate","IsChurn", "IsChurn_2Months", "IsChurn_3Months", "IsChurn_6Months")
disp <- data.frame(ignore.set)
names(disp) <- "Variables Ignored for Business Reasons"
disp
##   Variables Ignored for Business Reasons
## 1                                 UserId
## 2                                StoreId
## 3                ReferenceMonthStartDate
## 4                                IsChurn
## 5                        IsChurn_2Months
## 6                        IsChurn_3Months
## 7                        IsChurn_6Months

Next, remove any variables with missing values, as PCA does not allow for these.

finaluserdata.miss <- finaluserdata[,-which(names(finaluserdata) %in% ignore.set)]
myNumSum <- numSummary(finaluserdata.miss)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missCNT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(n))
ignore.miss <- myNumSum[myNumSum[,5] > 0,1]
disp <- data.frame(ignore.miss)
names(disp) <- c("Variables with Missing Values")
disp
##                      Variables with Missing Values
## 1              Store_UserEmployeeRatio_Last3Months
## 2                          Store_UserEmployeeRatio
## 3                Store_AverageMinsPerClass_Service
## 4               Store_AveragePricePerClass_Service
## 5     Store_AverageCancellationHrsPerClass_Service
## 6  Store_AverageRescheduleDeadlinePerClass_Service
## 7           Store_AverageCancellationHrsPerService
## 8        Store_AverageRescheduleDeadlinePerService
## 9                     Store_AveragePricePerService
## 10                     Store_AverageMinsPerService
## 11                       Store_AverageMinsPerClass
## 12                      Store_AveragePricePerClass
## 13            Store_AverageCancellationHrsPerClass
## 14         Store_AverageRescheduleDeadlinePerClass

Remove variables that have no variance, as these are not useful for predictive modeling

finaluserdata.var <- finaluserdata.miss[,-which(names(finaluserdata.miss) %in% ignore.miss)]
ignore.var <- names(which(round(apply(finaluserdata.var, MARGIN = 2, FUN = var), 4)<= 0))
disp <- data.frame(ignore.var)
names(disp) <- "Variables with No Variance"
disp
##                      Variables with No Variance
## 1                              NonRequiredForms
## 2                             NonRequiredWaiver
## 3                      NonRequiredHealthHistory
## 4                         RequiredHealthHistory
## 5                              Store_SMSEnabled
## 6                      Store_Integration_FitBPO
## 7                  Store_Integration_MotionSoft
## 8                   Store_Integration_Paramount
## 9                Store_Integration_ShareYourFit
## 10       Store_Integration_VisualFitnessPlanner
## 11 Store_IntegrationType_LandingPages_Marketing
## 12            Store_IntegrationType_SocialMedia

Perform One-Hot Processing to change remaining categorical variables into numeric ones, a PCA requirement.

The output from this code is excessive so the results are not shown.

finaluserdata.enc <- finaluserdata.var[,-which(names(finaluserdata.var) %in% ignore.var)]
ignore.char <- row.names(charSummary(finaluserdata.enc))
encoder <- onehot(finaluserdata.enc[,which(names(finaluserdata.enc) %in% ignore.char)], stringsAsFactors = TRUE)
dat.encoded <- predict(encoder, finaluserdata.enc)
finaluserdata.encoded <- data.frame(finaluserdata.enc[,-which(names(finaluserdata.enc) %in% ignore.char)], dat.encoded)
head(finaluserdata.encoded)

Perform PCA to reduce our data to a more manageable number of variables. Start by looking at the percentage of total information contained within each of the new variables, known as principal Components.

The results of the first 13 ordered descending records are shown.

pca <- princomp(finaluserdata.encoded)
info <- pca$sdev / sum(pca$sdev)
disp <- data.frame(paste(round(info*100,1), "%"))
names(disp) <- "Information Percentage by Component"
row.names(disp) <- names(info)
head(disp, 15)
##         Information Percentage by Component
## Comp.1                               89.8 %
## Comp.2                                5.1 %
## Comp.3                                1.9 %
## Comp.4                                1.8 %
## Comp.5                                0.4 %
## Comp.6                                0.2 %
## Comp.7                                0.1 %
## Comp.8                                0.1 %
## Comp.9                                0.1 %
## Comp.10                               0.1 %
## Comp.11                                 0 %
## Comp.12                                 0 %
## Comp.13                                 0 %
## Comp.14                                 0 %
## Comp.15                                 0 %

This highlights a significant reason why the User Churn models were not very effective. The 199 predictors evaluated by PCA could be nearly entirely replaced by a few dense variables. In other words, the data did not contain very much “information”.

info.cum <- cumsum(info)
disp <- data.frame(paste(round(info*100,1), "%"))
names(disp) <- c("Cumulative Information Percentage by Component")
rownames(disp) <- names(info.cum)
disp
##          Cumulative Information Percentage by Component
## Comp.1                                           89.8 %
## Comp.2                                            5.1 %
## Comp.3                                            1.9 %
## Comp.4                                            1.8 %
## Comp.5                                            0.4 %
## Comp.6                                            0.2 %
## Comp.7                                            0.1 %
## Comp.8                                            0.1 %
## Comp.9                                            0.1 %
## Comp.10                                           0.1 %
## Comp.11                                             0 %
## Comp.12                                             0 %
## Comp.13                                             0 %
## Comp.14                                             0 %
## Comp.15                                             0 %
## Comp.16                                             0 %
## Comp.17                                             0 %
## Comp.18                                             0 %
## Comp.19                                             0 %
## Comp.20                                             0 %
## Comp.21                                             0 %
## Comp.22                                             0 %
## Comp.23                                             0 %
## Comp.24                                             0 %
## Comp.25                                             0 %
## Comp.26                                             0 %
## Comp.27                                             0 %
## Comp.28                                             0 %
## Comp.29                                             0 %
## Comp.30                                             0 %
## Comp.31                                             0 %
## Comp.32                                             0 %
## Comp.33                                             0 %
## Comp.34                                             0 %
## Comp.35                                             0 %
## Comp.36                                             0 %
## Comp.37                                             0 %
## Comp.38                                             0 %
## Comp.39                                             0 %
## Comp.40                                             0 %
## Comp.41                                             0 %
## Comp.42                                             0 %
## Comp.43                                             0 %
## Comp.44                                             0 %
## Comp.45                                             0 %
## Comp.46                                             0 %
## Comp.47                                             0 %
## Comp.48                                             0 %
## Comp.49                                             0 %
## Comp.50                                             0 %
## Comp.51                                             0 %
## Comp.52                                             0 %
## Comp.53                                             0 %
## Comp.54                                             0 %
## Comp.55                                             0 %
## Comp.56                                             0 %
## Comp.57                                             0 %
## Comp.58                                             0 %
## Comp.59                                             0 %
## Comp.60                                             0 %
## Comp.61                                             0 %
## Comp.62                                             0 %
## Comp.63                                             0 %
## Comp.64                                             0 %
## Comp.65                                             0 %
## Comp.66                                             0 %
## Comp.67                                             0 %
## Comp.68                                             0 %
## Comp.69                                             0 %
## Comp.70                                             0 %
## Comp.71                                             0 %
## Comp.72                                             0 %
## Comp.73                                             0 %
## Comp.74                                             0 %
## Comp.75                                             0 %
## Comp.76                                             0 %
## Comp.77                                             0 %
## Comp.78                                             0 %
## Comp.79                                             0 %
## Comp.80                                             0 %
## Comp.81                                             0 %
## Comp.82                                             0 %
## Comp.83                                             0 %
## Comp.84                                             0 %
## Comp.85                                             0 %
## Comp.86                                             0 %
## Comp.87                                             0 %
## Comp.88                                             0 %
## Comp.89                                             0 %
## Comp.90                                             0 %
## Comp.91                                             0 %
## Comp.92                                             0 %
## Comp.93                                             0 %
## Comp.94                                             0 %
## Comp.95                                             0 %
## Comp.96                                             0 %
## Comp.97                                             0 %
## Comp.98                                             0 %
## Comp.99                                             0 %
## Comp.100                                            0 %
## Comp.101                                            0 %
## Comp.102                                            0 %
## Comp.103                                            0 %
## Comp.104                                            0 %
## Comp.105                                            0 %
## Comp.106                                            0 %
## Comp.107                                            0 %
## Comp.108                                            0 %
## Comp.109                                            0 %
## Comp.110                                            0 %
## Comp.111                                            0 %
## Comp.112                                            0 %
## Comp.113                                            0 %
## Comp.114                                            0 %
## Comp.115                                            0 %
## Comp.116                                            0 %
## Comp.117                                            0 %
## Comp.118                                            0 %
## Comp.119                                            0 %
## Comp.120                                            0 %
## Comp.121                                            0 %
## Comp.122                                            0 %
## Comp.123                                            0 %
## Comp.124                                            0 %
## Comp.125                                            0 %
## Comp.126                                            0 %
## Comp.127                                            0 %
## Comp.128                                            0 %
## Comp.129                                            0 %
## Comp.130                                            0 %
## Comp.131                                            0 %
## Comp.132                                            0 %
## Comp.133                                            0 %
## Comp.134                                            0 %
## Comp.135                                            0 %
## Comp.136                                            0 %
## Comp.137                                            0 %
## Comp.138                                            0 %
## Comp.139                                            0 %
## Comp.140                                            0 %
## Comp.141                                            0 %
## Comp.142                                            0 %
## Comp.143                                            0 %
## Comp.144                                            0 %
## Comp.145                                            0 %
## Comp.146                                            0 %
## Comp.147                                            0 %
## Comp.148                                            0 %
## Comp.149                                            0 %
## Comp.150                                            0 %
## Comp.151                                            0 %
## Comp.152                                            0 %
## Comp.153                                            0 %
## Comp.154                                            0 %
## Comp.155                                            0 %
## Comp.156                                            0 %
## Comp.157                                            0 %
## Comp.158                                            0 %
## Comp.159                                            0 %
## Comp.160                                            0 %
## Comp.161                                            0 %
## Comp.162                                            0 %
## Comp.163                                            0 %
## Comp.164                                            0 %
## Comp.165                                            0 %
## Comp.166                                            0 %
## Comp.167                                            0 %
## Comp.168                                            0 %
## Comp.169                                            0 %
## Comp.170                                            0 %
## Comp.171                                            0 %
## Comp.172                                            0 %
## Comp.173                                            0 %
## Comp.174                                            0 %
## Comp.175                                            0 %
## Comp.176                                            0 %
## Comp.177                                            0 %
## Comp.178                                            0 %
## Comp.179                                            0 %
## Comp.180                                            0 %
## Comp.181                                            0 %
## Comp.182                                            0 %
## Comp.183                                            0 %
## Comp.184                                            0 %
## Comp.185                                            0 %
## Comp.186                                            0 %
## Comp.187                                            0 %
## Comp.188                                            0 %
## Comp.189                                            0 %
## Comp.190                                            0 %
## Comp.191                                            0 %
## Comp.192                                            0 %
## Comp.193                                            0 %
## Comp.194                                            0 %
## Comp.195                                            0 %
## Comp.196                                            0 %
## Comp.197                                            0 %
## Comp.198                                            0 %
## Comp.199                                            0 %
head(disp, 15)
##         Cumulative Information Percentage by Component
## Comp.1                                          89.8 %
## Comp.2                                           5.1 %
## Comp.3                                           1.9 %
## Comp.4                                           1.8 %
## Comp.5                                           0.4 %
## Comp.6                                           0.2 %
## Comp.7                                           0.1 %
## Comp.8                                           0.1 %
## Comp.9                                           0.1 %
## Comp.10                                          0.1 %
## Comp.11                                            0 %
## Comp.12                                            0 %
## Comp.13                                            0 %
## Comp.14                                            0 %
## Comp.15                                            0 %

It appears only the first five components are needed. Use this information in Azure Machine Learning Studio to rerun the Initial Data Modeling step using the new dataset containing the five principal components, as well as the columns containing the missing data, as they could still provide value. Here are the results:

Original Dataset (220 Predictors)

Model Family Hyper-parameters Precision Recall
Boosted Decision Tree Leaves: 32 ~ Minimum Instances: 6 ~ Learning Rate: 0.252098531 ~ Trees: 270 57.1% 16.5%
Boosted Decision Tree Leaves: 54 ~ Minimum Instances: 19 ~ Learning Rate: 0.336396247 ~ Trees: 51 55.0% 15.9%
Boosted Decision Tree Leaves: 17 ~ Minimum Instances: 13 ~ Learning Rate: 0.06286619 ~ Trees: 50 54.5% 12.6%

PCA Dataset (12 Predictors + 5 Principal Components)

Model Family Hyper-parameters Precision Recall
Neural Network - Gaussian Normalizer Learning Rate: 0.034618 ~ Loss Function: Cross Entropy ~ Iterations: 29 15.0% 2.0%
Neural Network - Gaussian Normalizer Learning Rate: 0.030355 ~ Loss Function: Cross Entropy ~ Iterations: 27 18.2% 1.5%
Neural Network - Binning Normalizer Learning Rate: 0.037861 ~ Loss Function: Cross Entropy ~ Iterations: 129 20.3% 1.1%

Using PCA on this dataset is not beneficial.

Feature Selection

This section walks through the Feature Selection phase to determine which variables are important to the models we identified in the Initial Data Modeling phase.

  • One of two models, Stores, will be examined in the first section.
  • In the following sections, the other model, Users is explored, albeit with less explanatory content.

Stores

It is known there are patterns in the Stores data to predict whether a Store will churn in the next three months. One concern with the current models is that they leverage over 100 variables. This could be troublesome to operationalize. Therefore, the next step is to identify which of the variables are important to the model. Since a model has been created, a technique known as permutation feature importance may be leveraged.

The Permutation Feature Importance module in Azure Machine Learning does not support the custom Precision * Recall metric. Therefore, run the module twice, once for Precision and once for Recall. Then determine which variables do not provide value for Precision or Recall. The results can be found in the Store Churn Data Modeling Results.xlsx file. Here is a summary:

Variable Name Precision PFI Recall PFI
TotalRevenue 0.754623 0.153846
TotalRevenue_Past3Months 0.474503 0
Status 0.142292 0.061538
Checkins_Evening 0.100334 0
Checkins_Hour6to7_Past3Months 0.036232 0
Checkins_Hour11to12_Past3Months 0.036232 0
ActiveUsers_Last3Months 0.036232 0
AverageMinsPerClass_Service 0.036232 0
Services 0.036232 0
Checkins_Hour8to9 0.023411 0
Integration_Listen360 0.012422 0.030769
Forms 0.012422 0.030769
Checkins_Monday 0.005929 0.015385
Checkins_Weekday 0.005929 0.015385
Checkins_Hour14to15 0.005929 0.015385
Checkins_Morning 0.005929 0.015385
Checkins_Afternoon 0.005929 0.015385
Checkins_Monday_Past3Months 0.005929 0.015385
Checkins_Hour12to13_Past3Months 0.005929 0.015385
Checkins_Evening_Past3Months 0.005929 0.015385
ActiveEmployees_Last3Months 0.005929 0.015385
UserEmployeeRatio_Last3Months 0.005929 0.015385
ActiveEmployees 0 0.015385

Any variables not included did not show any importance for either metric. Zeroes are not actually zeroes - they represent very small numbers. When variable size is reduced, the performance of the metrics is reduced. This is seen below.

Full Variable Set

Test/Training Set Precision Recall Precision * Recall
Set 1 0.893 0.384 0.343

Precision/Recall Variable Set

Test/Training Set Precision Recall Precision * Recall
Set 1 0.815 0.336 0.27384
Set 2 0.767 0.287 0.220129
Set 3 0.84 0.309 0.25956

Another analysis was performed including other evaluation metrics, Accuracy and Average Log Loss.

Variable Name Precision PFI Recall PFI Accuracy PFI Average Log Loss PFI
TotalRevenue 0.754623 0.153846 0.041257 0.345677
TotalRevenue_Past3Months 0.474503 0 0.016699 0.032497
Status 0.142292 0.061538 0.003438 0.08103
Checkins_Evening 0.100334 0 0.001473 0
Checkins_Hour6to7_Past3Months 0.036232 0 0.000491 0
Checkins_Hour11to12_Past3Months 0.036232 0 0.000491 0
ActiveUsers_Last3Months 0.036232 0 0.000491 0.007357
AverageMinsPerClass_Service 0.036232 0 0.000491 0.001426
Services 0.036232 0 0.000491 0
Checkins_Hour8to9 0.023411 0 0 0
Integration_Listen360 0.012422 0.030769 0.000982 0.005943
Forms 0.012422 0.030769 0.000982 0.004854
Checkins_Monday 0.005929 0.015385 0.000491 0.005731
Checkins_Weekday 0.005929 0.015385 0.000491 0.002713
Checkins_Hour14to15 0.005929 0.015385 0.000491 0.003422
Checkins_Morning 0.005929 0.015385 0.000491 0.003546
Checkins_Afternoon 0.005929 0.015385 0.000491 0.002234
Checkins_Monday_Past3Months 0.005929 0.015385 0.000491 0.000259
Checkins_Hour12to13_Past3Months 0.005929 0.015385 0.000491 0
Checkins_Evening_Past3Months 0.005929 0.015385 0.000491 0.00133
ActiveEmployees_Last3Months 0.005929 0.015385 0.000491 0
UserEmployeeRatio_Last3Months 0.005929 0.015385 0.000491 0
ActiveEmployees 0 0.015385 0 0.000637
PurchasedSessions 0 0 0 0.009245
Amenities 0 0 0 0.007819
StoreAgeMonths 0 0 0 0.003792
UserEmployeeRatio 0 0 0 0.002884
Checkins_Thursday_Past3Months 0 0 0 0.002714
Checkins 0 0 0 0.002576
Checkins_Hour4to5 0 0 0 0.00185
Checkins_Past3Months 0 0 0 0.001458
Checkins_Hour9to10 0 0 0 0.001359
AveragePricePerService 0 0 0 0.000855
AveragePricePerClass_Service 0 0 0 0.000808
Checkins_Hour6to7 0 0 0 0.000788
Checkins_Hour1to2_Past3Months 0 0 0 0.000597
Checkins_Hour13to14 0 0 0 0.000224
NonRequiredWaiver 0 0 0 0.000083
Checkins_Hour17to18 0 0 0 0.000071
Checkins_Saturday 0 0 0 0.000013

If all of these variables are used, there is some improvement moving closer to the original model.

Full Variable Set (130 Variables)

Test/Training Set Precision Recall Precision * Recall
Set 1 0.893 0.384 0.343

Precision/Recall Variable Set (23 Variables)

Test/Training Set Precision Recall Precision * Recall
Set 1 0.815 0.336 0.27384
Set 2 0.767 0.287 0.220129
Set 3 0.84 0.309 0.25956

Precision/Recall/Accuracy/Average Log Loss Variable Set (40 Variables)

Test/Training Set Precision Recall Precision * Recall
Set 1 0.88 0.338 0.29744
Set 2 0.889 0.3 0.2667
Set 3 0.96 0.353 0.33888

This looks promising. Given the Users data contains all of these Store variables, we will wait until we see the Permutation Feature Importance results for the User Churn scenario before we decide which variables we will keep.

Survival Analysis

Introduction

In the Initial Data Modeling phase for User Churn, the data set does not contain sufficient information for Classification. However, there are other approaches to this type of problem. One such alternative is Survival Analysis.

In Survival Analysis, the goal is to predict Time to Failure. In order to do this, definition for Start Time and End Time must be defined. The End Time is obvious, it is the first day of the Churn Month for that User. For simplicity, it was decided to use the first day of the Activation Month for that User as the Start Time.

In this Survival Analysis, the Cox Proportional Hazards model is used. This model allows passing in a number of numeric variables for each combination of month and user. The model will predict how may months until the User is expected to churn.

Data Preparation

To prepare our data for the Cox Proportional Hazards model, the dataset must be changed slightly. Instead of using the IsChurn, create a new set of variables called MonthsUntilChurn. The SQL for creating this dataset can be found in the [ML].[SA_CompleteUserData] stored procedure in the [Reports] database.

Below is the survival curve for the sampled dataset.

sauserdata <- read_csv("C:/Users/cweaver/Downloads/Users_SA_1M_2017_12_29.csv", col_types = cols(
    Store_UserEmployeeRatio = col_double(),
    Store_UserEmployeeRatio_Last3Months = col_double(),
    Store_AverageMinsPerClass_Service = col_double(),
    Store_AveragePricePerClass_Service = col_double(),
    Store_AverageCancellationHrsPerClass_Service = col_double(),
    Store_AverageRescheduleDeadlinePerClass_Service = col_double(),
    Store_AverageMinsPerClass = col_double(),
    Store_AveragePricePerClass = col_double(),
    Store_AverageCancellationHrsPerClass = col_double(),
    Store_AverageRescheduleDeadlinePerClass = col_double(),
    Store_AverageMinsPerService = col_double(),
    Store_AveragePricePerService = col_double(),
    Store_AverageCancellationHrsPerService = col_double(),
    Store_AverageRescheduleDeadlinePerService = col_double())
    ,progress = FALSE
)

sa <- Surv(sauserdata$MonthsUntilChurn)
sa.fit <- survfit(sa ~ 1)
plot(sa.fit, main = "User Churn Survival Curve", xlab = "Months Active", ylab = "Proportion of Users Still Active")

Following the same process explore previously, remove non-predictive variables -

ignore.set <- c("UserId", "StoreId", "ReferenceMonthStartDate","IsChurn", "IsChurn_2Months", "IsChurn_3Months", "IsChurn_6Months", "MonthsUntilChurn_RC")
disp <- data.frame(ignore.set)
names(disp) <- "Variables Ignored for Business Reasons"
disp
##   Variables Ignored for Business Reasons
## 1                                 UserId
## 2                                StoreId
## 3                ReferenceMonthStartDate
## 4                                IsChurn
## 5                        IsChurn_2Months
## 6                        IsChurn_3Months
## 7                        IsChurn_6Months
## 8                    MonthsUntilChurn_RC

Remove any variables with missing values -

sauserdata.miss <- sauserdata[,-which(names(sauserdata) %in% ignore.set)]
myNumSum <- numSummary(sauserdata.miss)[, c(1,7,8,16,17)]
myNumSum <- tibble::rownames_to_column(myNumSum)
names(myNumSum)[5] <- "missCNT"
names(myNumSum)[1] <- "Variable_Name"
myNumSum <- arrange(myNumSum, desc(n))
ignore.miss <- myNumSum[myNumSum[,5] > 0,1]
disp <- data.frame(ignore.miss)
names(disp) <- c("Variables with Missing Values")
disp
##                      Variables with Missing Values
## 1              Store_UserEmployeeRatio_Last3Months
## 2                          Store_UserEmployeeRatio
## 3                Store_AverageMinsPerClass_Service
## 4               Store_AveragePricePerClass_Service
## 5     Store_AverageCancellationHrsPerClass_Service
## 6  Store_AverageRescheduleDeadlinePerClass_Service
## 7                     Store_AveragePricePerService
## 8           Store_AverageCancellationHrsPerService
## 9        Store_AverageRescheduleDeadlinePerService
## 10                     Store_AverageMinsPerService
## 11                       Store_AverageMinsPerClass
## 12                      Store_AveragePricePerClass
## 13            Store_AverageCancellationHrsPerClass
## 14         Store_AverageRescheduleDeadlinePerClass

Removed variables with no variance -

sauserdata.var <- sauserdata.miss[,-which(names(sauserdata.miss) %in% ignore.miss)]
ignore.var <- names(which(round(apply(sauserdata.var, MARGIN = 2, FUN = var), 4)<= 0))
disp <- data.frame(ignore.var)
names(disp) <- "Variables with No Variance"
disp
##                      Variables with No Variance
## 1                              NonRequiredForms
## 2                             NonRequiredWaiver
## 3                      NonRequiredHealthHistory
## 4                         RequiredHealthHistory
## 5                              Store_SMSEnabled
## 6                      Store_Integration_FitBPO
## 7                  Store_Integration_MotionSoft
## 8                   Store_Integration_Paramount
## 9                Store_Integration_ShareYourFit
## 10       Store_Integration_VisualFitnessPlanner
## 11 Store_IntegrationType_LandingPages_Marketing
## 12            Store_IntegrationType_SocialMedia

Perform One-Hot Processing to change any remaining categorical variables into numeric ones. (Again, the results are excessive and not displayed.)

sauserdata.enc <- sauserdata.var[,-which(names(sauserdata.var) %in% ignore.var)]
ignore.char <- row.names(charSummary(sauserdata.enc))
encoder <- onehot(sauserdata.enc[,which(names(sauserdata.enc) %in% ignore.char)], stringsAsFactors = TRUE)
dat.encoded <- predict(encoder, sauserdata.enc)
sauserdata.encoded <- data.frame(sauserdata.enc[,-which(names(sauserdata.enc) %in% ignore.char)], dat.encoded)
glimpse(sauserdata.encoded)

Create our testing and training sets -

ind.train <- sample(seq_len(dim(sauserdata.encoded)[1]), floor(dim(sauserdata.encoded)[1] * .7))
sauserdata.train <- sauserdata.encoded[ind.train,]
sauserdata.test <- sauserdata.encoded[-ind.train,]
disp <- as.data.frame(matrix(c(dim(sauserdata.train),dim(sauserdata.test)),ncol=2,byrow=FALSE))
names(disp) <- c("Training set", "Testing Set")
row.names(disp) <- c("Number of Records", "Number of Variables")
disp
##                     Training set Testing Set
## Number of Records         626721      268596
## Number of Variables          200         200

Cox Proportional Hazards Model

Train the survival analysis predictive model.

sa <- Surv(sauserdata.train$MonthsUntilChurn)
cox.form <- formula(paste("sa ~ ", paste(names(sauserdata.train), collapse = " + "), " - MonthsUntilChurn"))
cox.fit <- coxph(cox.form, data = sauserdata.train)

Survival Analysis is different than Classification. In Survival Analysis, binomial predictions are not output. Instead, survival curves are provided. Here is a survival curve for one user.

cox.test <- survfit(cox.fit, newdata = sauserdata.test)$surv
plot(cox.test[,1], main = "Survival Curve for User", xlab = "Months Active", ylab = "Probability of Survival", type = "l")
abline(a=.5, b=0, col = "red")

The curve is used to determine when the User is likely to churn. In other words, when is the probability of the user churning greater than the probability of the user not churning - when does the probability of survival drop below 0.5?

pred <- function(x = c(), thresh = .5){
    y <- order(x<=thresh, x, decreasing = c(TRUE, TRUE))[1]
    return(y)
}
sapred <- apply(cox.test, MARGIN = 2, FUN = pred)
saeval <- data.frame(sauserdata.test$MonthsUntilChurn, sapred)
names(saeval) <- c("Actual", "Predicted")
head(saeval)
##    Actual Predicted
## 2       1         1
## 7       4         1
## 8       1         1
## 11     18         1
## 22     13         1
## 26     15         1

The survival analysis for User Churn did not provide adequate performance to proceed. Valorem submits this is further evidence the data for User Churn does not support and algorithmic solution. More user-centric data is needed.

NEW Data Analysis

This information below was not originally intended in the project SOW. It was created as a natural extension of the Digital Insights Workshop to provide a glimpse of what a modern data platform can provide in terms of additional data insights.

Introduction

In the User Data Modeling phase, it was discovered the User Churn data does not contain patterns or signals strong enough to create a useful algorithmic model. This leads to the question, What data should we have so we can develop a user churn predictive model?

Conclusion

Valorem created a Power BI demonstration to help ClubReady answer this question. The demonstration solution is available for review in the Users Prototype Analysis.pbix file. This was created to introduce ClubReady to the opportunities for data exploration using the simplicity - and power - of Power BI.

The analysis that follows can be reproduced using the PBIX file.

Data

The variables in the [ClubReady].[dbo].[Users], [UserAmenityHistory], [Tags] and [ContractPurchases] tables were evaluated to determine which ones were sufficiently populated and contained information with potentially useful analytic value. The variables include:

The SQL code can be found in the [PBI].[Users] view within the [Reports] database.

Activations

Activations higher than churn is desirable because that means ClubReady is gaining users. This is the case for the the majority of the time frames shown. [ChurnMonthStartDate] is a potential predictor for User Churn.

Here are a few more trends found by using Power BI:

Binary Users Variables

[CheckinCredentialsModified], [HasDoctor] and [HasAltPhone] variables are potentially useful. However, further inspection shows that [HasDoctor] and [HasAltPhone] variables are too rarely populated to be of use. This leaves [CheckinCredentialsModified].

User Age

Evaluate the User Age variables, AgeAtActivation and AgeAtChurn.

Churn and Activations seem to follow a similar trend with respect to age. Explore further by examining Churn % by Age.

There is a clear trend between Churn and AgeAtActivation.

Months Active

Does this same relationship hold true for Length of Membership?

The top histogram shows a very common problem with certain numeric data, such as date-based and monetary data. This data is generally very right-skewed. A way is needed to create bins that are not equally sized. An easy way to do this is with the logarithm transform. The bottom histogram shows how this creates a much more usable visualization. There’s obviously a relationship between Total Churn and Length of Membership. Now determine if this holds true for Churn %.

Another interesting result! User Churn within the first year is extremely high. However, this drops dramatically as they remain active for subsequent years. [MonthsActive] is also a potentially powerful predictor.

Geography

Evaluate User Address on Churn.

Looking at the charts on the right, it is easy to see that State, Zip and City show high variability in User Churn. This could potentially make these fields useful for modeling. However, it is important to note the size of these variables in the charts on the left. Given the number of distinct values in these variables, it is unlikely using them directly will lead to much power. Instead, this may be a use case for some type of count-based or ratio-based variables. Another thing to consider is that the City names seem very unreliable. Hesitant to call these “high-quality” predictors without being able to subject them to data cleansing.

Referral Type and Prospect Type

The final two variables from the Users table are [Referral Type] and [Prospect Type].

Similar to Geos,[Referral Type]and[Prospect Type]` show high variability for User Churn. However, their high granularity leads us to think that Count-based or Ratio-based variables would be beneficial.

Amenity

The next dataset we want to examine is UserAmenityHistory. This dataset contains all of the Amenities connected to each user. This dataset is quite small and only contains one field of interest, [AmenityName]. Examine the impact Amenity selection has on User Churn.

Amenity selection creates variability for User Churn. Use Power BI’s built-in interactions to see how this plays out within particular stores.

It is reasonable to conclude Amenity is a potential predictor for User Churn, especially in conjunction with Store. However, given it’s high granularity, may need to consider some type of Count-based or Ratio-Based approach.

Tag

Similar to Amenities, this dataset contains all of the Tags associated to each user.

Tags suggest a very similar story to Amenities. However, given the free-form nature of the field and how rarely they are used, it seems unlikely much insight can be derived. This is supported up by the fact the “Tags” field in the Store Churn dataset and it was thrown out by the Feature Selection process.

ContractPurchases

Evaluate ContractPurchases that contains the membership-related purchases for each customer.

Binary ContractPurchases Variables

[HasRequiredSignatures], [ElectronicSignature] and [Membership] all show variability for User Churn. The bottom charts shows these variables are well-populated. The same is not true for [CommissionsOk] and [SMSCheck]. - HasRequiredSignature = TRUE : ElectronicSignature - HasRequiredSignature = TRUE : Membership - ElectronicSignature = TRUE : Membership

It appears [HasRequiredSignatures], [ElectronicSignature] and [Membership] fields all contain similar information. This means they are potentially good predictors.

Purchased Sessions

The major component of the ContractPurchases dataset is [PurchasedSessions]. This informs how active a user is.

Because PurchasedSessions distribution is right-skewed, a log transformation is applied. Logarithm transformation.

Interesting. One way to create useful variables out of counts is to create “buckets”, also known as “discretization”. This chart tells us that we could create the following buckets for [PurchasedSessions], possibly leading to a useful predictor.

  • 0 PurchasedSessions
  • 1 PurchasedSession
  • 2 PurchasedSessions
  • 3 PurchasedSessions
  • 4 - 40 PurchasedSessions
  • More than 40 PurchasedSessions

Review Purchased Sessions over Time to see if there is anything useful.

Curious [PurchasedSessions] follows the same chronological trend as Activations, as opposed to Churn. This is not helpful for modeling, but it does suggest further exploration may be warranted. Additionally, the scatterplot at shows [PurchasedSessions] was positively correlated with Churn in 2016 but negatively in 2017. Again, this is not helpful for modeling, but it is informative and thought-provoking.

It is likely including PurchasedSessions and a discretized version in the User Churn dataset may help create a better model.

Electronic Signatures Taken

The final variable in the ContractPurchases dataset is [ElectronicSignaturesTaken]. This variable records the number of signatures required for the user to purchase the contract.

There appears to be a weak correlation between [ElectronicSignaturesTaken] and [Churn %]. However, the number of users in each of the segments greater than 0 is small. Therefore, it may also be beneficial to include a binary version of this variable, [ElectronicSignaturesTaken>0]. Theoretically, this should be the same as the [ElectronicSignature] variable. However, there is a small data quality issue that causes them to be different. It might be worthwhile to investigate this at a later date.

Results

Below are all the fields identified in this section as potentially useful for predicting User Churn. These can be found in the User Churn Data Analysis Results.xlsx file.

Database Schema Table Field Final Table Final Field Notes
Reports ML UserChurn All Users
ClubReady dbo Users CheckinCredentialsModified Users CheckinCredentialsModified
ClubReady dbo Users EmergencyContactName Users HasEmergencyContact
ClubReady dbo Users EmergencyContactPhone Users HasEmergencyContact
ClubReady dbo Users Phone Users HasPhone
ClubReady dbo Users CellPhone Users HasCellPhone
ClubReady dbo ContractPurchases ActivationDateUtc Users AgeAtActivation
ClubReady dbo ContractPurchases AgreedDate Users AgeAtActivation
ClubReady dbo Users dob Users AgeAtActivation
ClubReady dbo Users (Reused Field) Users MonthsActive
ClubReady dbo Users State Users State Needs Count-Based or Ratio-Based Approach
ClubReady dbo Users Zip Users Zip Needs Count-Based or Ratio-Based Approach
ClubReady dbo Users City Users City Needs Count-Based or Ratio-Based Approach; Needs Data Cleansing
ClubReady dbo Users ReferralTypeId Users Referral Type Needs Count-Based or Ratio-Based Approach
ClubReady dbo ReferralType ReferralType Users Referral Type Needs Count-Based or Ratio-Based Approach
ClubReady dbo Users ProspectTypeId Users Prospect Type Needs Count-Based or Ratio-Based Approach
ClubReady dbo ProspectType ProspectType Users Prospect Type Needs Count-Based or Ratio-Based Approach
ClubReady dbo Amenities AmenityName UserAmenityHistory Amenity Needs Count-Based or Ratio-Based Approach
ClubReady dbo ContractPurchases HasRequiredSignatures ContractPurchases HasRequiredSignatures
ClubReady dbo ContractPurchases ElectronicSignature ContractPurchases ElectronicSignature
ClubReady dbo ContractPurchases Membership ContractPurchases Membership
ClubReady dbo ContractPurchases TotalSessions ContractPurchases PurchasedSessions Discretize to [0, 1, 2, 3, 4-40, >40]
ClubReady dbo ContractPurchases ElectronicSignaturesTaken ContractPurchases ElectronicSignaturesTaken Discretize to [0, >0]

Further Analysis

The tables that could contain potentially useful variables for User Churn are presented below:

  • [ClubReady].[dbo].[PurchaseLog]
  • [ClubReady].[dbo].Stores
  • [ClubReady].[dbo].[Checkinlog_v]
  • [ClubReady].[dbo].[UserWaiverRequirement]
  • [ClubReady].[dbo].[DeclineLog]
  • [ClubReady].[dbo].[bookings]
  • [ClubReady].[dbo].[loginlog]

Conclusions & Next Steps

Throughout the Digital Insights Workshop, Valorem and ClubReady collaborated to create a data-driven journey designed to reshape ClubReady into a data first organization.

Store Churn: A prototype “Store Churn” algorithm that predicts when a Store is likely to go out of business in the next three months was created. This is accessible via an Azure Machine Learning Web Service. It is recommended ClubReady analyze the results of this algorithm for the next few months to determine its real-world accuracy. If successful, ClubReady may incorporate the algorithm in its application or use it as a standalone utility. If the algorithm fails to meet expected performance then a targeted analysis to determine it’s deficiencies and corrective action could be undertaken.

User Churn: Less success developing a User Churn algorithm was realized. The data analysis and the modeling strongly suggests the data does not have sufficient user behavioral informative-rich data to build an algorithmic predictive solution. This was discovered during the user churn modeling and confirmed through PCA Analysis. Perhaps in the future ClubReady applications will include more user behavioral data so a user churn model can be developed.

Future Data Exploration: The data exploration and associated R code and the resulting Power BI Desktop file can be used as a guide to design a full-scale analytic platform. This provides ClubReady the opportunity to interactively explore what stories the data might tell thereby unlocking the ability to not only answer questions, but also to understand which questions have yet to be asked. This contributes to faster, accurate and more efficient reporting opportunities. This is a critical step to becoming a data first organization.

Recommendation - Transaction Platform Redesign: ClubReady uses a single Azure IaaS SQL Database to power its application. This is the product of years of agile development and rapidly implemented business features - often customer-specific changes. With the availability of newer technologies, such as Data Lake, this is an opportune time to consider a redesign of the existing system. This will open up new opportunities for more agile and/or continuous integration and development. It will also strengthen ClubReady’s ability to perform routine and ad hoc analyses as part of its Data Analytics journey. This builds the environment where machine learning can become the fabric stitching the data and evolving business requirements into a industry-leading data-first platform.